Speaker: Jacob Eisenstein
Affiliation: Google
Title: Uncertainty and Underspecification in Humanities Applications of Natural Language Processing
Abstract: Natural language processing (NLP) and other applications of machine learning are playing an increasingly large role in the humanities and social sciences. These technologies make it possible to automate labeling tasks that would otherwise be too time-consuming or tedious to perform by hand. However, NLP systems do not achieve perfect accuracy. More concerning, their errors are not uniformly distributed, but rather, can encode and propagate bias from the training data. In this talk, I will survey the validity threats that arise from various types of learning-based NLP systems, ranging from classical models like logistic regression to contemporary architectures that combine pretraining, fine-tuning, and domain adaptation. Of particular interest is the problem of underspecification, highlighted by D’Amour et al (2020): even when overall test set accuracy can be estimated with high precision, the performance on other metrics may vary dramatically with seemingly irrelevant details such as the seed value used to randomly initialize the model’s parameters. After this survey, I will describe our approach to validity in our work on Abolitionist Networks (Soni et al 2021), which combines topic models and network analysis. I will conclude by offering a set of recommendations for practitioners in the digital humanities.
1 Like
One last question about BERT vs. word2vec when looking at the ‘leadership’ on semantic change wielded by various newspapers/subcorpora. I wonder if word2vec and type-based embeddings force a linear model of ‘change’, from ‘old’ to ‘new’ meanings, that forces a ‘leadership’ model whereby one venue ‘leads’ and another ‘follows’ in its usage of a word. First of all, I wonder if that model imposes ‘leadership’ relations too liberally, given that every venue will be somewhere along the temporal spectrum fold ‘old’ to ‘new’. Second, what I find exciting about token-based embeddings like BERT is that it might allow us to think about semantic history in more multidimensional ways: there may not be an old and new meaning to a word like ‘equality’ so much as a variety of meanings (mathematical, political, etc) which reconfigure in various ways. In that sense a particular venue/magazine may not ‘lead’ a semantic change so much as experiment in semantic developments which never take on more broadly or do so only partially. That said, I haven’t been able to experiment with BERT much yet due to its high computational cost and due to nervousness about its having been trained on contemporary language. @fotis also gave a great presentation showing how token-based embeddings do not perform better on semantic tasks. So, I’m not sure where the frontier is in thinking about semantic change, particularly in the ways you guys were thinking about it here with respect to locating specific venues/contexts as driving or leading change (which imo is underexplored). In any case, thanks for the very interesting talk!
2 Likes
I think there are some good reasons to prefer contextualized embeddings for understanding semantic changes. As you suggest with the intuitions about ‘equality’, many changes can be thought of as the appearance/disappearance of senses, which would be better handled by something like clustering on token embeddings rather than dynamic type embeddings. The clusters could then be treated as words, and techniques for identifying leaders in event cascades could be applied. There are some details that would need to be figured out to get it to work, but it seems like a promising direction.
But I disagree that our approach implies that every venue will be somewhere on a 1D spectrum from old to new. We do identify changes by estimating temporal embeddings for each word, but to find leaders and followers, we work with newspaper-specific temporal embeddings. It could be (and often is) the case that each newspaper’s embedding of a word is evolving more or less independently, in which case we would not have any leader or follower. We use randomization to threshold the autocorrelation scores: for each word and candidate leader/follower dyad, the autocorrelation score has to be significantly higher than the scores that arise from chance in a model in which articles are assigned randomly to newspapers.
2 Likes