I am curious what suggestions you have on the normalization of historical variants of language. I am working with a digitized corpus (roughly 50,000 pages) of manuscripts in the Dutch language, written between 1500 and 1850. There is a large spelling variation within the corpus, due to not only its diachronicity, but also to differences in literacy of the authors, and geographical variation (from Ghent to Groningen). The result is a corpus that is quite messy regarding its spelling, which complicates the process of computational text analysis and the adaptation of NLP tools to the corpus.
Your suggestions might be reading recommendations, tools, dictionaries, corpora, other research projects etc, or maybe even methods to work around the problem of spelling variation, instead of trying to solve it.