I have a complicated corpus on which I would like to perform some measurement like a similarity metric of texts amongst each other. Fuzzy reuse detection or intertextuality detection would also be ok.
However, the corpus is complicated in the following way:
- It’s a recipe group of one alchemical recipe which is the content of every text, yet the recipe changes with experimentation. First we thought it might be a special form of a stemma, yet we have realized in the meantime that it is not. However, it’s also not helpful to detect topics or use topic-related measurements because the texts are extremely similar overall. They also have different lenghts (from half a page to 20 pages), so some thematic variation can easily happen due to differences in text length - even when the chemical message of the text isn’t very different.
- The texts are mostly in early modern German (Frühneuhochdeutsch), so many out-of-the-box methods don’t apply.
We have already used a few things but the results weren’t extremely helpful. We have also tried semantic tagging which was helpful, yet we also encountered a visualization problem: We need to compare 30 texts to each other and it’s getting really hard to create a useful output. If it were one stemma we can relate all other texts to, it would be easy but sadly it isn’t.
We have tried exact string matching (using longest common subsequences) which has yielded not so bad results because many texts use direct quotations from other ones to which they relate.
We have chosen this approach free of normalization and common NLP pipelines because we initially thought that spelling variants might be indicative of relationsships between the texts (only very few can be dated properly), so we didn’t want to “normalize them away”. But after further consideration, I feel that we might have overestimated the value of these orthographic differences, so I’m thinking about maybe normalizing them, so that we’d have more pre-existing tools/libraries accessible to us.
(Also, maybe relevant to know, it’s a non-funded side project, so I don’t have lots of resources, either time or money, so I’m not sure if I can do the normalization at all, even though the corpus isn’t huge).
What would you say my task is? I’m unsure whether what I want to do is really intertextuality detection, a similarity metric or text reuse detection…
I’d be happy to hear some of your suggestions and thanks for reading this long post
PS: I already saw this post Spelling variation in historical language where people suggested FastText which I haven’t tried yet - I sure will try but I’m somehow doubtful it would give useful results on a corpus like mine even if they offered the language in question… what do you think?