Similarity/Text reuse/Intertextuality detection on complicated corpus

sarah.lang · July 20, 2020, 3:00pm

Dear all,

I have a complicated corpus on which I would like to perform some measurement like a similarity metric of texts amongst each other. Fuzzy reuse detection or intertextuality detection would also be ok.
However, the corpus is complicated in the following way:

It’s a recipe group of one alchemical recipe which is the content of every text, yet the recipe changes with experimentation. First we thought it might be a special form of a stemma, yet we have realized in the meantime that it is not. However, it’s also not helpful to detect topics or use topic-related measurements because the texts are extremely similar overall. They also have different lenghts (from half a page to 20 pages), so some thematic variation can easily happen due to differences in text length - even when the chemical message of the text isn’t very different.
The texts are mostly in early modern German (Frühneuhochdeutsch), so many out-of-the-box methods don’t apply.

We have already used a few things but the results weren’t extremely helpful. We have also tried semantic tagging which was helpful, yet we also encountered a visualization problem: We need to compare 30 texts to each other and it’s getting really hard to create a useful output. If it were one stemma we can relate all other texts to, it would be easy but sadly it isn’t.
We have tried exact string matching (using longest common subsequences) which has yielded not so bad results because many texts use direct quotations from other ones to which they relate.

We have chosen this approach free of normalization and common NLP pipelines because we initially thought that spelling variants might be indicative of relationsships between the texts (only very few can be dated properly), so we didn’t want to “normalize them away”. But after further consideration, I feel that we might have overestimated the value of these orthographic differences, so I’m thinking about maybe normalizing them, so that we’d have more pre-existing tools/libraries accessible to us.
(Also, maybe relevant to know, it’s a non-funded side project, so I don’t have lots of resources, either time or money, so I’m not sure if I can do the normalization at all, even though the corpus isn’t huge).

What would you say my task is? I’m unsure whether what I want to do is really intertextuality detection, a similarity metric or text reuse detection…

I’d be happy to hear some of your suggestions and thanks for reading this long post

Best,
Sarah

PS: I already saw this post Spelling variation in historical language where people suggested FastText which I haven’t tried yet - I sure will try but I’m somehow doubtful it would give useful results on a corpus like mine even if they offered the language in question… what do you think?

folgert · July 21, 2020, 6:25am

Hi @sarah.lang!

Specific techniques for discovering text re-use is not my specialty but perhaps people like @enrique.manjavacas or @mike.kestemont can help here.

The task as you describe it reminds me of phylogenetic approaches to textual change. If you frame it like that, you may want to consider using multiple text re-use measures, rather than a single one. This would mean you create some feature vector of text-reuse metrics and perform your analysis on that. You could then also consider assigning different weights to the different features, to make sure the value of certain aspects (like orthography) are not overestimated.

I’d be interested in what other people have to say! Thanks for the question.

enrique.manjavacas · July 21, 2020, 7:33am

Hi Sarah,

You might want to consider lemmatizing your text. Since you’ll be having to deal with spelling variation but probably would like to do matching on terms abstracting over morphological differences, lemmatization might solve both issues at the same time. Also, even if a lemmatizer gets the actual lemmas wrong for particular word types, that’s not so bad as long as the errors are consistent. Now the self-plug. We did some work on lemmatization of historical languages (https://www.aclweb.org/anthology/N19-1153/) and did some experiments on historical German using the REN corpus. I think you could find enough training material for a decent early modern high german lemmatizer. If you are interested I could train a specific model for your corpora, or help you tagging it.

I am also working on a python library for text reuse detection, there are 3 types of algorithms (using set-based metrics, alignments and vector space models) and we are looking into improving coverage through the integration of distributional semantics. That’s unfortunately still ongoing work, but if you are interested the repository is here (not very well documented at the moment): https://www.github.com/emanjavacas/retrieve

andrea_nini · July 27, 2020, 12:03pm

there are 3 types of algorithms (using set-based metrics, alignments and vector space models)

@enrique.manjavacas, do you have any papers (or webpage or any other resources) to recommend to learn more about these?

mike.kestemont · August 17, 2020, 12:11pm

Hi @sarah.lang!

If the texts are not overly long, it might be useful to run the parallel versions through https://collatex.net/ which is good at aligning parallel witnesses. (If you run the algorithm at the character-level it might even solve some of the orthographic variation.) Running FastText would be cool too, but is unlikely to be successful if there is no sizeable reference corpus available of Frühneuhochdeutsch. (Is there?)

(A quick alternative that I have sometimes used would be something like turnitin and apply it to all text pairs: I’ve seen discussions where they use the overlap scores returned by that system to make scholarly arguments: https://dearauthor.com/features/industry-news/master-of-the-universe-versus-fifty-shades-by-e-l-james-comparison/) I do think you can present/cast your task as one in the domain of intertextuality/text reuse detection.

I would also try to avoid a full NLP pipeline if you don’t need it. If there’s no lemmatizer available for this kind of material, it would take you ages to create one (which doesn’t feasible for a side project).

Hope that helps. If not: let us know!

Mike

sarah.lang · August 18, 2020, 8:04am

Hi everybody,

thanks so much for answering my question / contributing and sorry for taking ages to write back.

In the meantime, I have decided that I will write a grant proposal for a small grant to get this work done (and also be able to try all the methods and evalute the effectiveness etc) because a grant scheme came up which might make this possible.

I will take all your ideas into account - they were very valuable. I’ll also have to write up my first paper on this at the end of the month - so I might be back with some questions then
It would be really fun if we got the money and I had the time to test out all the great suggestions you had…

(also, given that we are applying for more money, in case you had thus far held back on some additional ideas because you thought they’d be too time-consuming - I might actually have time for this next year if the project goes through…)

Best,
Sarah

sarah.lang · August 18, 2020, 8:18am

Hi

I have already tried collatex and comparable tools (juxtacommons or Uni Halle’s LERA tool), however, these tools are meant for very similar texts which actually form a stemma. Mine are very stimilar but they don’t really form a stemma because people have been adding their own experimentations and experiences, thus changing the text too much to be a stemma. Also, these tools obviously would get small orthographical differences but in reality, they don’t because the overall text is too dissimilar.

Or did I maybe not get the usage of Collatex right? Maybe there would have been ways to optimize it for my use case but I haven’t found it so far.
I was also thinking about trying the TRACER tool (https://www.etrap.eu/tag/intertextuality/ & https://www.etrap.eu/historical-text-re-use/) - do you have any experiences with that?

This realization that tools like Collatex don’t work for me actually was my starting point.
What I did, essentially was exact string matching (using Python, writing it back into TEI-XML) - which worked surprisingly well - and visualizing them in a custom LaTeX transformation (the owners of the data are a bit old-fashioned and got scared with HTML so I decided I’d better make PDFs for them).

I will definitely consider multiple text reuse metrics, like @folgert said and I will also give @enrique.manjavacas’s lemmatization a shot which sounds fascinating! Maybe will also try fasttext…
@folgert Do you have any literature tips on phylogenetic approaches to textual change?

Oh and @mike.kestemont: I had also thought about using some plagiarism checker. I’ve been fascinated about that topic for a while (using the web to find intertextuality in my early modern texts) in a way that doesn’t crash the internet or gets me blocked :D) but I haven’t really made any progress yet. Also, of course, this will only capture texts which have been digitized and transcribed/OCR’d which is pretty biased after all… I’ll also look into this again should I get the money.

However, since I will be writing a grant proposal for this now, there won’t be any actual progress any time soon as my time will go into writing the proposal

mike.kestemont · August 18, 2020, 9:14am

Thanks clarifying this, @sarah.lang! I understand the background of the problem better now. I’ve worked with Tracer before and always found it a bit daunting because it has so many settings and can be hard to parametrize, especially if you’re not deep into the (Java) code. In my own research or tutorials (much simpler problems!) I often resort to this more straightforward (Python) library for spotting text reuse: https://github.com/JonathanReeve/text-matcher.

sarah.lang · August 18, 2020, 9:57am

thanks for the link - and also for admitting that you find Tracer daunting too. I have felt the same way - it’s kind of scary and I didn’t get the hang of it. And since I wasn’t sure it was going to be a great match/solution for my problem anyway, I decided to not pursue it any longer.
Maybe should I get the project, I might try again…

enrique.manjavacas · August 18, 2020, 1:47pm

Hey, sorry for the late reply. The literature on all of these is way too broad to specify any particular paper. I am currently working on a survey that I hope to put out soon that is quite related. I will post it here once it’s out.

andrea_nini · August 18, 2020, 3:19pm

That would be fantastic. Thanks!

folgert · August 22, 2020, 6:40am

The work of Sara Graça da Silva and Jamie Tehrani comes to mind. They are not using automatically extracted features, but the setup could be quite similar. Here are two papers:

https://royalsocietypublishing.org/doi/full/10.1098/rsos.150645

and

fpianz · August 29, 2020, 12:47pm

Hi Sarah,
I just finished a project where I had to detect text reuse on a very messy Twitter corpus. I tried Tracer without success and then moved to Passim and BLAST.
Out of 14k tweets, passim identified only 500 text reuses (4%), BLAST 7k (52%) with very few false positive.
BLAST is a bioinformatics tool but you can substitute proteins with letters and find matches in text instead of DNA sequences. Only downside is that it cannot match paraphrases like Tracer is supposed to do.

Check also this application to Chinese literary texts: https://culturalanalytics.org/article/11054-a-blast-based-language-agnostic-text-reuse-algorithm-with-a-markus-implementation-and-sequence-alignment-optimized-for-large-chinese-corpora

sarah.lang · November 3, 2020, 8:17am

Hi again,

in the meantime - it was already ages ago that I last wrote! - we have completed the first part of the grant application. Let’s hope that they invite us to write the full proposal and then take the project because it would be a really cool one…

Anyways, I’ve tried the text-matcher and it was really good - especially as it basically did what I was doing anyway (but of course, my short code didn’t do it as well). Thanks a lot for that.

The problem with that is now that I’d need to write those results back into my data and we’re currently speaking between 40-50 texts which all need to be compared against eachother, so really not something I can manage as a side project.

Anyways, I wanted to share some screenshots of the textmatcher (CLI output) compared to my old analyses (in the meantime 1,5 years old) which were visualized in LaTeX and which had missed out on some very obvious textual overlaps - that text-matcher got.

We also did some semantic tagging to be able to visualize which parts of the recipe are present in which texts and what terms they used:

In this image, you can see one of our nice results: The paragraph at the right hand side doesn’t show any matches (of course, it’s easily possible they were not found by accident or we didn’t include the corresponding intertext in the corpus). But if it were actually free of intertext, it could mean that this text came from the author’s own experimentation.

Can I ask - what is the effect of having lots of hapaxes in the text (from not normalizing, thus not lemmatizing etc) on algorithms like stylometry? Does it behave much worse? Because my results seem ok - but just wondering.
It was 70% hapaxes, is that normal according to your experience? My values are:

Total number of types in corpus: 4907
Total number of hapax legomena: 3578

As you can see - the human-chosen groupings are denoted with “G1”, “G2” and “G3” and they are recreated pretty well by the (pydelta) stylometric analysis.

Thanks again for your help!
Hoping you’re all well,
Sarah

andrea_nini · November 18, 2020, 5:41pm

If I understand everything correctly, the hapaxes have no influence on the analysis because you only used the 500 most frequent words to calculate the Delta distance.

sarah.lang · November 19, 2020, 12:42pm

Right, that makes sense - thanks!
Could have gotten to conclusion by thinking a bit harder maybe