Speaker: Jacob Eisenstein
Title: Uncertainty and Underspecification in Humanities Applications of Natural Language Processing
Abstract: Natural language processing (NLP) and other applications of machine learning are playing an increasingly large role in the humanities and social sciences. These technologies make it possible to automate labeling tasks that would otherwise be too time-consuming or tedious to perform by hand. However, NLP systems do not achieve perfect accuracy. More concerning, their errors are not uniformly distributed, but rather, can encode and propagate bias from the training data. In this talk, I will survey the validity threats that arise from various types of learning-based NLP systems, ranging from classical models like logistic regression to contemporary architectures that combine pretraining, fine-tuning, and domain adaptation. Of particular interest is the problem of underspecification, highlighted by D’Amour et al (2020): even when overall test set accuracy can be estimated with high precision, the performance on other metrics may vary dramatically with seemingly irrelevant details such as the seed value used to randomly initialize the model’s parameters. After this survey, I will describe our approach to validity in our work on Abolitionist Networks (Soni et al 2021), which combines topic models and network analysis. I will conclude by offering a set of recommendations for practitioners in the digital humanities.