Speaker: Thora Hagen
Affiliation: University of Würzburg
Title: Semantic enrichment of type-based word embeddings for small text corpora
Abstract: One of the most important steps to language modeling is preparing a decently sized text corpus. As language modeling has become more popular in the field of digital humanities as well, it has become apparent that this hurdle exactly can be very high for researchers to be able to make use of traditional language modeling methods. This contribution demonstrates how to improve the semantic quality of pre-trained word embeddings that have been built using a smaller English text corpus by incorporating structured knowledge (synonyms, antonyms, and hypernyms). Using the state-of-the-art method GLEN (Generalized Lexical ENtailment model, Glavaš and Vulić 2019), the experiment shows mixed results: While semantic similarity and lexical entailment show improvements, the performance for semantic relatedness falls behind in this setting. However, limiting the training data to only synonyms improves semantic relatedness as well, which implies that using more but diverse training data is not always advisable when wanting to improve the semantic quality of a smaller word embedding model, as this goal merges a variety of semantic aspects.