`Entrez!' she called: Evaluating Language Identification Tools in English Literary Texts

:speech_balloon: Speaker: Erik Ketzan and Nicolas Werner

:classical_building: Affiliation: Centre for Digital Humanities, Trinity College Dublin, Dublin, Ireland & University of Cologne, Cologne, Germany

Title: `Entrez!’ she called: Evaluating Language Identification Tools in English Literary Texts

Abstract: This short paper presents work in progress on the evaluation of current language identification (LI) tools for identifying foreign language n-grams in English-language literary texts, for instance, β€œβ€˜Entrez!’ she called”. We first manually annotated French and Spanish words appearing in 12,000-word text samples by F. Scott Fitzgerald and Ernest Hemingway using a TEI tag. We then split the tagged sample texts into four groups of n-grams, from unigram to tetragram, and compared the accuracy of five LI packages on correctly identifying the language of the tagged foreign-language snippets. We report that, of the packages tested, Fasttext proved most accurate for this task overall, but that methodological questions and future work remain.

:newspaper: Link to paper
