This post is an edited version of a workshop, which was scheduled to take place in June 2020 at Leiden University. If you are interested in the discussion, feel free to contact me (Lauren) to be updated on the rescheduled workshop date.
The recent development and release of neural language models that create so-called ‘contextualized token-embeddings’ has been considered revolutionary in Computer Science/NLP and the industry. Yet, far out in the uncharted backwaters of the unfashionable end of the galaxy, there is a small but steadily growing group of people who have started to wonder whether the application of these models could also support and advance text-based humanities research, and, in particular, linguistic research.
First, and most obviously perhaps, these models can make the life of researchers much, much easier when used as access/retrieval tools. At present, a large number of texts have been digitized formally, enabling retrieval of words/phrases/sentences, which can subsequently be interpreted/annotated manually. Yet, with the latest neural language models (e.g. ULMFiT , ELMo , BERT ), for instance, it is also possible to ‘digitize’ the subtle and complex denotations and connotations of word tokens and even abstract sentence structures by capturing them in contextualized word/phrase/sentence/paragraph-token embeddings (thus resolving the issues faced by type-based models such as Word2Vec, GloVe with homonymy and polysemy, cf. Desagulier 2019).
Importantly, these contextualised embeddings enable semantic (‘onomasiological’) searches within textual material: after setting out a semantic domain, researchers can retrieve and explore all words and other linguistic forms in that domain (e.g. ‘Which words/phrases express [facets of conservative nationalism]/[that an event takes place in the future]?’) without necessarily making an a priori selection of forms. Furthermore, the embeddings also facilitate semasiological tasks, as they perform excellently in sense disambiguation, semantic role labelling, and recognising abstract grammatical patterns that are notoriously difficult to retrieve with structural searches alone (e.g. cleft-structures).
At the same time, some recent overview articles have also suggested that computational distributional semantic could have a particular relevance for linguistic analysis, and furthering linguistic theory (e.g. Boleda 2020). Yet, while there is no question that distributional semantic models of all types can and do serve as invaluable retrieval tools, their value as analytic tools may not be unambiguously accepted.
I would like to open a discussion between researchers and research teams (within text-based Humanities disciplines as well as Computer Science) who have recently worked with contextualized word embeddings or ‘token-vectors’. In particular, this forum invites thoughts on:
- the amount of data necessary to pre-train different neural language models;
- specificities of balancing (i.e. removing bias) and pre-processing the training material;
- the computational resources needed to successfully pre-train these models;
- if experienced with diachronic/historical data: issues of periodization and spelling normalization;
- issues of transparancy, reliability, interpretability, and theoretical soundness of neural language models (in particular as compared to ‘count models’ – as described in Baroni et al. 2014; for an interesting, recent dissertation about the use of type- and token-based count-vectors, see De Pascale 2019, and the output of the Nephological Semantics research team).
In sum, I hope this forum will grow into a reference point – or perhaps even a kind of annotated bibliography – for (Computational) Humanities researchers interested in reading up on the possibilities and problems of employing predictive or count distributional semantic models in their research. Please feel free to add your thoughts, summarise your achievements, and refer to your (favourite) publications.
Note:
Operationalizing research involving interpretation and analysis of textual material can be considered particularly challenging in historical research because, if the interpretation of linguistic/textual material is exclusively done introspectively, it will inevitably be done from a modern/present-day point-of-view. That such anachronistic interpretations are a non-trivial issue becomes evident when considering that the past is a different country, and they do things differently there:
- there are substantial differences between the way in which concepts such as class, gender, norms and prestige function in different cultures and time periods;
- there is a growing amount of criticism regarding the extent to which the de- and connotations of words and phrases in historical texts can successfully be derived through introspective judgements by present-day language users.
Given these concerns, it has become quintessential to explore/develop new methodologies and tools that aid in the semantic analysis of corpus data to approach the study of historical semantics in a more unbiased, data-driven way – and it is precisely in this field that the use of distributional semantic models becomes particularly interesting (e.g. Sagi et al. 2011; Jenset 2013; Hilpert & Correia Saavedra 2018; and the many studies conducted by Budts as part of her PhD research). Yet, it is also precisely in this field that issues of data availability and model transparency are particularly prominent, which may problematize relying on context-predicting, neural models. I particularly encourage discussions on the use of distributional semantic models in historical text-based research, and research into language change.
And, of course, before I forget:
References
Baroni, M., Dinu, G. & G. Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 238-247.
Boleda, G. 2020. Distributional Semantics and Linguistic Theory. Annu. Rev. Linguist. 6, 1-22.
De Pascale, S. 2019. token-based vector space models as semantic control in lexical lectometry. PhD Dissertation, KU Leuven.
Desagulier, G. 2019. Can word vectors help corpus linguists? Studia Neuphilologica.
Hilpert, M. & D. Correia Saavedra. 2018. Using token-based semantic vector spaces for corpus-linguistic analyses: From practical applications to tests of theoretical claims. Corpus Linguistics and Linguistic Theory 22(3), 357-380.
Jenset, G. B. 2013. Mapping meaning with distributional methods. A diachronic corpus-based study of existential there. Journal of Historical Linguistics 3(2). 272–306.
Sagi, E., Kaufmann, S. & B. Clark. 2011. Tracing semantic change with Latent Semantic Analysis. In K. Allan & J. Robinson (eds.), Current methods in historical linguistics, 161-183. Berlin: Mouton De Gruyter.
model papers
Howard, J. & S. Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. https://arxiv.org/abs/1801.06146
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. & L. Zettlemoyer. ELMo: Deep Contextualized Word Representations. https://allennlp.org/elmo
Devlin, J., Ming-Wei, C., Lee, K. & K. Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/pdf/1810.04805.pdf