Spelling variation in historical language

Hi all,

I am curious what suggestions you have on the normalization of historical variants of language. I am working with a digitized corpus (roughly 50,000 pages) of manuscripts in the Dutch language, written between 1500 and 1850. There is a large spelling variation within the corpus, due to not only its diachronicity, but also to differences in literacy of the authors, and geographical variation (from Ghent to Groningen). The result is a corpus that is quite messy regarding its spelling, which complicates the process of computational text analysis and the adaptation of NLP tools to the corpus.

Your suggestions might be reading recommendations, tools, dictionaries, corpora, other research projects etc, or maybe even methods to work around the problem of spelling variation, instead of trying to solve it.

Thanks!
Alie

3 Likes

I remember this paper talking about using VARD2.

Iā€™ve used the following method for historical English and for my purposes it worked ok. Maybe the method might be useful for Dutch as well.

Also, you might get quite far with regular expressions.

3 Likes

Interesting suggestion to use vector space models. I remember from applying FastText that the top neighbors of words were quite often spelling variations. Using that might be a relatively cheap and straightforward way to solve (part of) your problem.

3 Likes

Specifically for Dutch there is this CLIN task: Wayback Machine

There is a large spelling variation within the corpus, due to not only its diachronicity, but also to differences in literacy of the authors, and geographical variation (from Ghent to Groningen).

This looks like a very difficult problem to solve, though.

We did something very similar to generate pairs for this paper, those pairs were then used to train an NMT system. Itā€™s also implemented in the natas library linked by Melvin above. Weā€™re currently experimenting it on Finnish and it seems to be ā€œā€ā€œsolvingā€ā€œā€ OCR post-correction and spelling normalisation at the same time. Note this is an eye-balled quality estimation and not a proper evaluation.
I guess as a first step perhaps something like described in our paper (ā€œmust be in most_similar()ā€ + ā€œmust not exist in a dictionaryā€ + ā€œmust have a levensthein distance < 3ā€) could work quite ok.

4 Likes

Thank yā€™all for thinking along! I already worked with the VARD2 tool and knew about the CLIN task, your other suggestions are new to me, so Iā€™m gonna check them out. In the meantime, other tips and tricks are still welcome :slight_smile:

1 Like

I donā€™t think this paper has been mentioned above; also this rather text-heavy poster poster by the same guy might be helpful. They deal with similar problems, small corpora, mad variation.

4 Likes

For spelling normalization check work done by Bollmann https://scholar.google.be/citations?hl=en&user=l3pm9QkAAAAJ&view_op=list_works&sortby=pubdate

If you had a particular task in mind, it could be that you get better results by applying a variation-aware approach to that task rather than trying to apply a standard approach on the normalized corpus. I have worked on lemmatization of historical languages (including Dutch) and our approach might help you getting a lemmatized corpus (on which you can already apply a bunch of processing):

4 Likes

Hi Alie, thanks for your question! Iā€™ve recently started a PhD project in language standardization in 17th century Dutch (in Leiden/Nijmegen), so Iā€™ll definitely be following this thread, as Iā€™ll likely run into similar issues. I expect our aim with the corpus is rather different though, as for me, the spelling variation could be part of what Iā€™d be studying, whereas your research, I believe, focusses more on historical content, so in your case normalization of your corpus data could be less problematic. What I think is a rather elegant approach is the one taken in the ā€˜Letters as Lootā€™ corpus by LUCL/INT (http://brievenalsbuit.inl.nl/zeebrieven/page/search): theyā€™ve kept the original spellings (so for instance both ā€˜maarā€™ and ā€˜maerā€™ for the Dutch word ā€˜butā€™) and have added a ā€˜lemmaā€™ in current day Dutch spelling of the word they both refer to: ā€˜maarā€™. However, adding lemmas might be too time consuming for you and might actually not solve your issues. Iā€™m curious to read more suggestions!

4 Likes

(This is maybe an unpopular opinion, but for many (historic) languages orthographic normalisation actually isnā€™t possible or desirable, because the language simply lacked an official standard at that time and it feels weird imposing that. Working with postag-lemma pairs to represent tokens might be a viable alternative, that is also easier to implement.)

Apart from that, I can confirm that FastText works really well for clustering historic spelling variants. If you find a nice clustering threshold you could replace the tokens in a cluster by the most frequent member of the cluster?

And of course: +1 for PIE! :slight_smile:

6 Likes