I am curious what suggestions you have on the normalization of historical variants of language. I am working with a digitized corpus (roughly 50,000 pages) of manuscripts in the Dutch language, written between 1500 and 1850. There is a large spelling variation within the corpus, due to not only its diachronicity, but also to differences in literacy of the authors, and geographical variation (from Ghent to Groningen). The result is a corpus that is quite messy regarding its spelling, which complicates the process of computational text analysis and the adaptation of NLP tools to the corpus.
Your suggestions might be reading recommendations, tools, dictionaries, corpora, other research projects etc, or maybe even methods to work around the problem of spelling variation, instead of trying to solve it.
Interesting suggestion to use vector space models. I remember from applying FastText that the top neighbors of words were quite often spelling variations. Using that might be a relatively cheap and straightforward way to solve (part of) your problem.
Specifically for Dutch there is this CLIN task: Wayback Machine
There is a large spelling variation within the corpus, due to not only its diachronicity, but also to differences in literacy of the authors, and geographical variation (from Ghent to Groningen).
This looks like a very difficult problem to solve, though.
We did something very similar to generate pairs for this paper, those pairs were then used to train an NMT system. Itās also implemented in the natas library linked by Melvin above. Weāre currently experimenting it on Finnish and it seems to be āāāsolvingāāā OCR post-correction and spelling normalisation at the same time. Note this is an eye-balled quality estimation and not a proper evaluation.
I guess as a first step perhaps something like described in our paper (āmust be in most_similar()ā + āmust not exist in a dictionaryā + āmust have a levensthein distance < 3ā) could work quite ok.
Thank yāall for thinking along! I already worked with the VARD2 tool and knew about the CLIN task, your other suggestions are new to me, so Iām gonna check them out. In the meantime, other tips and tricks are still welcome
I donāt think this paper has been mentioned above; also this rather text-heavy poster poster by the same guy might be helpful. They deal with similar problems, small corpora, mad variation.
If you had a particular task in mind, it could be that you get better results by applying a variation-aware approach to that task rather than trying to apply a standard approach on the normalized corpus. I have worked on lemmatization of historical languages (including Dutch) and our approach might help you getting a lemmatized corpus (on which you can already apply a bunch of processing):
Hi Alie, thanks for your question! Iāve recently started a PhD project in language standardization in 17th century Dutch (in Leiden/Nijmegen), so Iāll definitely be following this thread, as Iāll likely run into similar issues. I expect our aim with the corpus is rather different though, as for me, the spelling variation could be part of what Iād be studying, whereas your research, I believe, focusses more on historical content, so in your case normalization of your corpus data could be less problematic. What I think is a rather elegant approach is the one taken in the āLetters as Lootā corpus by LUCL/INT (http://brievenalsbuit.inl.nl/zeebrieven/page/search): theyāve kept the original spellings (so for instance both āmaarā and āmaerā for the Dutch word ābutā) and have added a ālemmaā in current day Dutch spelling of the word they both refer to: āmaarā. However, adding lemmas might be too time consuming for you and might actually not solve your issues. Iām curious to read more suggestions!
(This is maybe an unpopular opinion, but for many (historic) languages orthographic normalisation actually isnāt possible or desirable, because the language simply lacked an official standard at that time and it feels weird imposing that. Working with postag-lemma pairs to represent tokens might be a viable alternative, that is also easier to implement.)
Apart from that, I can confirm that FastText works really well for clustering historic spelling variants. If you find a nice clustering threshold you could replace the tokens in a cluster by the most frequent member of the cluster?