Spelling variation in historical language

alie · April 24, 2020, 12:15pm

Hi all,

I am curious what suggestions you have on the normalization of historical variants of language. I am working with a digitized corpus (roughly 50,000 pages) of manuscripts in the Dutch language, written between 1500 and 1850. There is a large spelling variation within the corpus, due to not only its diachronicity, but also to differences in literacy of the authors, and geographical variation (from Ghent to Groningen). The result is a corpus that is quite messy regarding its spelling, which complicates the process of computational text analysis and the adaptation of NLP tools to the corpus.

Your suggestions might be reading recommendations, tools, dictionaries, corpora, other research projects etc, or maybe even methods to work around the problem of spelling variation, instead of trying to solve it.

Thanks!
Alie

melvin.wevers · April 24, 2020, 1:27pm

I remember this paper talking about using VARD2.

I’ve used the following method for historical English and for my purposes it worked ok. Maybe the method might be useful for Dutch as well.

Also, you might get quite far with regular expressions.

folgert · April 24, 2020, 1:31pm

Interesting suggestion to use vector space models. I remember from applying FastText that the top neighbors of words were quite often spelling variations. Using that might be a relatively cheap and straightforward way to solve (part of) your problem.

simon · April 24, 2020, 4:21pm

Specifically for Dutch there is this CLIN task: Wayback Machine

There is a large spelling variation within the corpus, due to not only its diachronicity, but also to differences in literacy of the authors, and geographical variation (from Ghent to Groningen).

This looks like a very difficult problem to solve, though.

We did something very similar to generate pairs for this paper, those pairs were then used to train an NMT system. It’s also implemented in the natas library linked by Melvin above. We’re currently experimenting it on Finnish and it seems to be “”“solving”“” OCR post-correction and spelling normalisation at the same time. Note this is an eye-balled quality estimation and not a proper evaluation.
I guess as a first step perhaps something like described in our paper (“must be in most_similar()” + “must not exist in a dictionary” + “must have a levensthein distance < 3”) could work quite ok.

alie · April 25, 2020, 9:17am

Thank y’all for thinking along! I already worked with the VARD2 tool and knew about the CLIN task, your other suggestions are new to me, so I’m gonna check them out. In the meantime, other tips and tricks are still welcome

andreskarjus · April 28, 2020, 10:57am

I don’t think this paper has been mentioned above; also this rather text-heavy poster poster by the same guy might be helpful. They deal with similar problems, small corpora, mad variation.

enrique.manjavacas · April 28, 2020, 11:32am

For spelling normalization check work done by Bollmann https://scholar.google.be/citations?hl=en&user=l3pm9QkAAAAJ&view_op=list_works&sortby=pubdate

If you had a particular task in mind, it could be that you get better results by applying a variation-aware approach to that task rather than trying to apply a standard approach on the normalized corpus. I have worked on lemmatization of historical languages (including Dutch) and our approach might help you getting a lemmatized corpus (on which you can already apply a bunch of processing):

Machteld · April 30, 2020, 2:50pm

Hi Alie, thanks for your question! I’ve recently started a PhD project in language standardization in 17th century Dutch (in Leiden/Nijmegen), so I’ll definitely be following this thread, as I’ll likely run into similar issues. I expect our aim with the corpus is rather different though, as for me, the spelling variation could be part of what I’d be studying, whereas your research, I believe, focusses more on historical content, so in your case normalization of your corpus data could be less problematic. What I think is a rather elegant approach is the one taken in the ‘Letters as Loot’ corpus by LUCL/INT (http://brievenalsbuit.inl.nl/zeebrieven/page/search): they’ve kept the original spellings (so for instance both ‘maar’ and ‘maer’ for the Dutch word ‘but’) and have added a ‘lemma’ in current day Dutch spelling of the word they both refer to: ‘maar’. However, adding lemmas might be too time consuming for you and might actually not solve your issues. I’m curious to read more suggestions!

mike.kestemont · May 1, 2020, 6:01pm

(This is maybe an unpopular opinion, but for many (historic) languages orthographic normalisation actually isn’t possible or desirable, because the language simply lacked an official standard at that time and it feels weird imposing that. Working with postag-lemma pairs to represent tokens might be a viable alternative, that is also easier to implement.)

Apart from that, I can confirm that FastText works really well for clustering historic spelling variants. If you find a nice clustering threshold you could replace the tokens in a cluster by the most frequent member of the cluster?

And of course: +1 for PIE!