Distributional Semantic Models: Predict or Count?

This post is an edited version of a workshop, which was scheduled to take place in June 2020 at Leiden University. If you are interested in the discussion, feel free to contact me (Lauren) to be updated on the rescheduled workshop date.

The recent development and release of neural language models that create so-called ‘contextualized token-embeddings’ has been considered revolutionary in Computer Science/NLP and the industry. Yet, far out in the uncharted backwaters of the unfashionable end of the galaxy, there is a small but steadily growing group of people who have started to wonder whether the application of these models could also support and advance text-based humanities research, and, in particular, linguistic research.

First, and most obviously perhaps, these models can make the life of researchers much, much easier when used as access/retrieval tools. At present, a large number of texts have been digitized formally, enabling retrieval of words/phrases/sentences, which can subsequently be interpreted/annotated manually. Yet, with the latest neural language models (e.g. ULMFiT , ELMo , BERT ), for instance, it is also possible to ‘digitize’ the subtle and complex denotations and connotations of word tokens and even abstract sentence structures by capturing them in contextualized word/phrase/sentence/paragraph-token embeddings (thus resolving the issues faced by type-based models such as Word2Vec, GloVe with homonymy and polysemy, cf. Desagulier 2019).

Importantly, these contextualised embeddings enable semantic (‘onomasiological’) searches within textual material: after setting out a semantic domain, researchers can retrieve and explore all words and other linguistic forms in that domain (e.g. ‘Which words/phrases express [facets of conservative nationalism]/[that an event takes place in the future]?’) without necessarily making an a priori selection of forms. Furthermore, the embeddings also facilitate semasiological tasks, as they perform excellently in sense disambiguation, semantic role labelling, and recognising abstract grammatical patterns that are notoriously difficult to retrieve with structural searches alone (e.g. cleft-structures).

At the same time, some recent overview articles have also suggested that computational distributional semantic could have a particular relevance for linguistic analysis, and furthering linguistic theory (e.g. Boleda 2020). Yet, while there is no question that distributional semantic models of all types can and do serve as invaluable retrieval tools, their value as analytic tools may not be unambiguously accepted.

I would like to open a discussion between researchers and research teams (within text-based Humanities disciplines as well as Computer Science) who have recently worked with contextualized word embeddings or ‘token-vectors’. In particular, this forum invites thoughts on:

  • the amount of data necessary to pre-train different neural language models;
  • specificities of balancing (i.e. removing bias) and pre-processing the training material;
  • the computational resources needed to successfully pre-train these models;
  • if experienced with diachronic/historical data: issues of periodization and spelling normalization;
  • issues of transparancy, reliability, interpretability, and theoretical soundness of neural language models (in particular as compared to ‘count models’ – as described in Baroni et al. 2014; for an interesting, recent dissertation about the use of type- and token-based count-vectors, see De Pascale 2019, and the output of the Nephological Semantics research team).

In sum, I hope this forum will grow into a reference point – or perhaps even a kind of annotated bibliography – for (Computational) Humanities researchers interested in reading up on the possibilities and problems of employing predictive or count distributional semantic models in their research. Please feel free to add your thoughts, summarise your achievements, and refer to your (favourite) publications.

Note:
Operationalizing research involving interpretation and analysis of textual material can be considered particularly challenging in historical research because, if the interpretation of linguistic/textual material is exclusively done introspectively, it will inevitably be done from a modern/present-day point-of-view. That such anachronistic interpretations are a non-trivial issue becomes evident when considering that the past is a different country, and they do things differently there:

  1. there are substantial differences between the way in which concepts such as class, gender, norms and prestige function in different cultures and time periods;
  2. there is a growing amount of criticism regarding the extent to which the de- and connotations of words and phrases in historical texts can successfully be derived through introspective judgements by present-day language users.

Given these concerns, it has become quintessential to explore/develop new methodologies and tools that aid in the semantic analysis of corpus data to approach the study of historical semantics in a more unbiased, data-driven way – and it is precisely in this field that the use of distributional semantic models becomes particularly interesting (e.g. Sagi et al. 2011; Jenset 2013; Hilpert & Correia Saavedra 2018; and the many studies conducted by Budts as part of her PhD research). Yet, it is also precisely in this field that issues of data availability and model transparency are particularly prominent, which may problematize relying on context-predicting, neural models. I particularly encourage discussions on the use of distributional semantic models in historical text-based research, and research into language change.

And, of course, before I forget:
References
Baroni, M., Dinu, G. & G. Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 238-247.
Boleda, G. 2020. Distributional Semantics and Linguistic Theory. Annu. Rev. Linguist. 6, 1-22.
De Pascale, S. 2019. token-based vector space models as semantic control in lexical lectometry. PhD Dissertation, KU Leuven.
Desagulier, G. 2019. Can word vectors help corpus linguists? Studia Neuphilologica.
Hilpert, M. & D. Correia Saavedra. 2018. Using token-based semantic vector spaces for corpus-linguistic analyses: From practical applications to tests of theoretical claims. Corpus Linguistics and Linguistic Theory 22(3), 357-380.
Jenset, G. B. 2013. Mapping meaning with distributional methods. A diachronic corpus-based study of existential there. Journal of Historical Linguistics 3(2). 272–306.
Sagi, E., Kaufmann, S. & B. Clark. 2011. Tracing semantic change with Latent Semantic Analysis. In K. Allan & J. Robinson (eds.), Current methods in historical linguistics, 161-183. Berlin: Mouton De Gruyter.

model papers
Howard, J. & S. Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. https://arxiv.org/abs/1801.06146
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. & L. Zettlemoyer. ELMo: Deep Contextualized Word Representations. https://allennlp.org/elmo
Devlin, J., Ming-Wei, C., Lee, K. & K. Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/pdf/1810.04805.pdf

6 Likes

Welcome to the forum, @l.fonteyn! The use of Neural Language Models is interesting for most people here, so thanks for addressing it! Like you, I’m also very much interested in the application of these new Neural Language Models to historical material. I guess one of the biggest stumbling blocks in the application of these models to historical material is, as you yourself indicate, the great diversity in textual material, and in particular the amount of spelling variation. If I recall correctly, the RoBERTa architecture uses some kind of ‘Byte Pair Encoding’ (BPE), which might be a way to deal spelling variation, but I have no experience with it and no idea how successful it is. At least for Dutch, this tokenisation strategy seems to pretty successful.

1 Like

A related thought. I have not a lot of experience working with BERT, but when I used word embedding, I usually used it not to capture generic language use, but specific uses/biases in language related to gender or political communities. In computational linguistics, people often try to remove such biases introduced by particular corpora. From a historian’s perspective, it’s exactly these biases that are of interest. Word2vec, however, is not fine-grained enough to capture biases expressed in contextual differences.

I wonder if BERT, which is now often used to create models to perform well on a range of language tasks, would be able to deal with specific instances of language use.

Let’s say, we take an existing model and fine-tune it with new data–a corpus of marxists text from the 1920s. If I’m then interested in understanding specific deep contextual uses words, and how these differ from more generic uses. How would one go about doing this with BERT?

Will these specific uses even be represented in the model, or will the model be too attuned to these texts (basically the bias-variance trade-off)? I can image that you can encode a specific vector that represents the Marxists corpus. Again, I’m curious to hear how others would approach this.

1 Like

You probably know this paper: https://nlp.stanford.edu/projects/histwords/
Maybe the procrustes approach could be applied in the same way to sentence embeddings (instead of word embeddings)?
(Ryan Heuser has a very nice gist implementing this kind of alignment.)

3 Likes

Exactly. I’ve worked with this technique, as well as the one proposed by Kim et al which uses an existing model as a starting point for new training with small learning rates. I was indeed wondering to what extent this could be extended to sentence embeddings.

1 Like

About Kim et al: it’s been shown to be very noisy, at least on type embeddings in German (newspapers, link) and English (twitter, link). We’ve also shown recently (on COHA) that Temporal Referencing is less noisy than Hamilton’s OP method. It’s a bit of a pain to train in its current implementation though.

We’re currently working on a relatively large comparison of different user-submitted methods as part of our SemEval task summary paper, with the caveats that it’s only two time periods, and it focuses on words, not sentences. But it’s human annotated, and in four languages.

5 Likes

Someone tweeted me this today, I suppose that answers your question!

Analysing Lexical Semantic Change with Contextualised Word Representations
Mario Giulianelli, Marco Del Tredic Raquel Fernandez
https://arxiv.org/pdf/2004.14118.pdf

I think this is really cool stuff, but the study uses pre-trained BERT (and only considers recent change (1960 to present) in COHA – but perhaps I misread that). Using the PDE-trained version of BERT on (much) older data is hasn’t been a success in my experience (but we’re cooking up something to sort that! :slightly_smiling_face:)

The same authors have submitted something to the SemEval task (is this the one Simon mentioned?):

https://arxiv.org/pdf/2005.00050.pdf

So in response to the fine-tuning question the answer appears to be: yes!

I’ve also come across these studies by Pia Sommerauer and Antske Fokkens:
https://www.aclweb.org/anthology/W19-4728/
https://research.vu.nl/en/publications/firearms-and-tigers-are-dangerous-kitchen-knives-and-zebras-are-n

The second paper really intrigues me, and I’ve run into related observations studying the spatial configurations of prepositions: if we are going to use distributional semantics to talk about pathways of change, it should be taken into account that these models are great at linguistic representations, but they have no perceptual experience. Looking at this wearing my Cognitive Linguist hat, I’d say this could mean that these models are good at finding semantic differences – like, they are great at recognising metaphorical extension – but they may not tell you anything about the how and why of that metaphorical extension (which will often be grounded in perceptual similarities between concepts that are not always straightforwardly captured in the linguistic context).

1 Like

Thanks! I’ll definitely check out the first paper :smile:

On the second paper, I think the authors raise valuable points as to working with embeddings, but you’re correct about the how and why’s. Even when contrasting changes found in models with external data it remains difficult to gauge the correlation/causality (if that’s even possible).

If I put on my very scruffy, old, cognitive psychologist, I wonder whether we can apply formal cognitive models to changes captured in distributional models.

That is an interesting point (perhaps one for a new topic!).

I suppose this is somewhat related: I’ve read some very interesting reflections on what meaning representation is, and it is certainly a possibility that distributional models only capture one (or a few) dimensions of a complex conglomerate of phenomena. I’ve been working a little bit on mapping the output of theoretical models of meaning representation onto the output of distributional models (with respect to grammatical meaning). I think it’s essential that we explore such things further. Like, I’ve intuitively understood the method itself merely approximates (and does not equate) meaning (unlike what Firth intended), but the fact that it does not (always) straightforwardly approximate plausible pathways of change (in the sense of: what is the most likely source of metaphorical extension X) is something that may not be immediately evident.

1 Like

Cool, thanks for posting this! And good look with (reviewing) the SemEval task!

1 Like

Has anyone seen a convincing example how to evaluate an word embedding trained on historical texts? The usual sources of external evaluation are obviously not there, so how can we decide which of two models representing , let’s say, printed German of the second half of the 18th Century are better? I know people have trained models and shown specific examples which look convincing, but how could we do a systematic evaluation?

1 Like

This is a great question!
I can only answer this from a linguists perspective, but yeah: there are no native speakers of 18th century German to help you evaluate a model, so I suppose the next best thing is a (bunch of) expert scholars to propose a standard that should be met. If you’re just using a model to help you with large-scale retrieval (say, of a particular subsense of a word, or a subordinate clause pattern that cannot be found with simple form searches), this shouldn’t be a problem as you’re simply using such models for a task you would have otherwise done manually. I’ve used embeddings in this way before, but it does bug me a little bit that this doesn’t entirely take the analyst out of the equation. A more intrinsic form of evaluating models (and the only one I’m aware of) would be studying their perplexity values.
Does anyone else have any thoughts on this?

1 Like

edit: added “word embedding” to make my message clearer

From what you describe it’s not a huge corpus. You could maybe do something à la Antoniak and Mimno (https://mimno.infosci.cornell.edu/papers/antoniak-stability.pdf): generate some interesting words with an LDA topic model, train a bunch of word embedding models, query the resulting models for those words, calculate deviation. It would not tell you which of your models is the best, but if you have a low deviation that’s a good hint that at least your models are properly trained.

Another solution would be to look at words you are more or less certain of: here we look at cardinal points and see if the nearest neighbours make sense (ideally the NNs for “East” should always be North West South). Obviously this is a simple example and what @l.fonteyn wrote above is better (if I understand correctly, the creation of a SimLex-like testset?), but it’s a good first sanity check if you don’t have many experts on hand.

1 Like

Language models can be evaluated either intrinsically or extrinsically. Intrinsically would be through perplexity or word-similarity (https://fh295.github.io/simlex.html). Extrinsically would be an evaluation on downstream tasks.

For other types of models you can use something like the Akaike Information Criterion (AIC).



AIC was used here:

Not really sure how to use that with LMs though …

1 Like

Self-plug, but the SemEval paper is finally out: https://arxiv.org/abs/2007.11464

tl;dr: contextual embeddings don’t seem to work as well as type embeddings on this task, with human-annotated data

3 Likes