Spelling variation in historical language

Hi Alie, thanks for your question! I’ve recently started a PhD project in language standardization in 17th century Dutch (in Leiden/Nijmegen), so I’ll definitely be following this thread, as I’ll likely run into similar issues. I expect our aim with the corpus is rather different though, as for me, the spelling variation could be part of what I’d be studying, whereas your research, I believe, focusses more on historical content, so in your case normalization of your corpus data could be less problematic. What I think is a rather elegant approach is the one taken in the ‘Letters as Loot’ corpus by LUCL/INT (http://brievenalsbuit.inl.nl/zeebrieven/page/search): they’ve kept the original spellings (so for instance both ‘maar’ and ‘maer’ for the Dutch word ‘but’) and have added a ‘lemma’ in current day Dutch spelling of the word they both refer to: ‘maar’. However, adding lemmas might be too time consuming for you and might actually not solve your issues. I’m curious to read more suggestions!

4 Likes