Semantic enrichment of type-based word embeddings for small text corpora

thora.hagen · November 16, 2022, 3:37pm

Speaker: Thora Hagen

Affiliation: University of Würzburg

Title: Semantic enrichment of type-based word embeddings for small text corpora

Abstract: One of the most important steps to language modeling is preparing a decently sized text corpus. As language modeling has become more popular in the field of digital humanities as well, it has become apparent that this hurdle exactly can be very high for researchers to be able to make use of traditional language modeling methods. This contribution demonstrates how to improve the semantic quality of pre-trained word embeddings that have been built using a smaller English text corpus by incorporating structured knowledge (synonyms, antonyms, and hypernyms). Using the state-of-the-art method GLEN (Generalized Lexical ENtailment model, Glavaš and Vulić 2019), the experiment shows mixed results: While semantic similarity and lexical entailment show improvements, the performance for semantic relatedness falls behind in this setting. However, limiting the training data to only synonyms improves semantic relatedness as well, which implies that using more but diverse training data is not always advisable when wanting to improve the semantic quality of a smaller word embedding model, as this goal merges a variety of semantic aspects.

Link to poster