Speaker: Erik Ketzan and Nicolas Werner
Affiliation: Centre for Digital Humanities, Trinity College Dublin, Dublin, Ireland & University of Cologne, Cologne, Germany
Title: `Entrez!β she called: Evaluating Language Identification Tools in English Literary Texts
Abstract: This short paper presents work in progress on the evaluation of current language identification (LI) tools for identifying foreign language n-grams in English-language literary texts, for instance, ββEntrez!β she calledβ. We first manually annotated French and Spanish words appearing in 12,000-word text samples by F. Scott Fitzgerald and Ernest Hemingway using a TEI tag. We then split the tagged sample texts into four groups of n-grams, from unigram to tetragram, and compared the accuracy of five LI packages on correctly identifying the language of the tagged foreign-language snippets. We report that, of the packages tested, Fasttext proved most accurate for this task overall, but that methodological questions and future work remain.