Looking for literature on comparing topics between models with vocabulary differences

When comparing topics between LDA models trained on different collections of text, it’s common for people to use the Jensen-Shannon divergence as a measure of the (dis)similarity between topics. However, this is complicated a bit if the two collections being compared have differences in their vocabularies (i.e., some word types may be present in one collection, but not the other).

For example, if collection 1 has vocabulary {a, b, c} and collection 2 has vocabulary {b, c, d}, then comparisons between topics trained on the different collections require a to be inserted into the topics from collection 2, and d to be inserted into the topics learned from collection 1. My guess is that most people just insert such words with probability of 0, but I haven’t been able to find a case where this is actually addressed in the paper.

I’m trying to think through the implications of how this issue might be handled and am looking for literature that addresses this in any way. Suggestions for stuff to read on this are highly appreciated—thanks!

Is there a specific reason why you wouldn’t want to train a topic model on all collections at the same time, and then do the comparison between post-hoc-re-created collections that would live in the same space?

1 Like

Good question! In my case, the two collections are basically the same, except that one has undergone some modifications that the other collection hasn’t, and I’d like to assess how different the resulting topics are as a result of those modifications.

Interesting problem. I have no recommendations for literature about this topic, but your approach of aligning the vocabularies of the two data sets seems like a good option. Have you tried it?

1 Like

Sorry it has taken me so long to reply, Folgert. I have tried this approach and I think it does make sense in cases where you expect the non-overlapping words to truly never occur within the other collection. However, for other cases where you don’t want to penalize as harshly for non-overlapping words between topics, it could make sense to simply ignore the word completely when calculating the dissimilarity between them. I’ve been trying to find different examples of how others have handled this within different contexts, but I am starting to suspect it isn’t really common dilemma.

It’s maybe not a perfect solution, and a bit brute-force, but could you:

  • generate a topic model for each corpus, which as you say are nearly identical besides some modifications
  • take the resulting topics from both models and find which topics in the one are most similar to the topics in the other, using…
    • simple word overlap among the top N words for each topic? (jaccard coefficient?)
    • spearman rank correlation among the ranked lists of top N words for each topic?
  • once you have pairwise connections between topics, you could examine which topics are most/least similar, and which words have the highest rank differences between the two paired topics?

If the process works, you could then re-run for different numbers of topics, different modeling parameters, etc, and then average/aggregate the results.

My guess is that most people just insert such words with probability of 0,

I am not sure this makes sense… I think your direction of thinking is right and it may work for the sake of comparison but I guess it would be better to do some smoothing s.t. prob > 0. Why not simply include the additional words in the Dirichlet prior in the topic-word distributions when setting the model up?

I think for most LDA implementations this would actually happen if you’d simply insert the extra words in the vocab. So it maybe just one line of code for you.

1 Like