Error while computing coherence values of topic models

alie · December 4, 2020, 3:47pm

Dear all,

I am building a topic model (using gensim) from a large corpus of texts, and I use the coherence value to decide on the number of topics I want the model to build.

I use the following formula to compute the coherence value for various numbers of topics:

def compute_coherence_values(dictionary, corpus, texts, limit, start, step):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet('PATH/mallet-2.0-4.8/bin/mallet', corpus=corpus, num_topics=num_topics, id2word=dictionary, iterations=ITERATIONS,  
                workers=N_WORKERS, optimize_interval=OPTIMIZE_INTERVAL)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=tokenized_texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

When I run the following line of code,

model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=tokenized_texts, start=10, limit=100, step=10)

I get this error at some point:

type: 13520 new topic: 2
0:9 5:8 
java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
	at cc.mallet.topics.WorkerRunnable.sampleTopicsForOneDoc(WorkerRunnable.java:552)
	at cc.mallet.topics.WorkerRunnable.run(WorkerRunnable.java:275)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

I tried to rerun the code several times, and the error always occurs at a different point.

I have used this method before and it worked fine at that time. So far existing threads on Github and StackOverflow are not clarifying it for me, so I hope there are one or two experts in this room who are better in understanding errors than I am

Best,

Alie

simon · December 5, 2020, 1:47pm

When in your loop is “at some point”?

Could you try again with N_WORKERS = 1 (i.e. making sure it’s single core)?

alie · December 7, 2020, 1:52pm

Whether it’s the number of workers (changed it from 3 to 1) or the fact that I’m using a better WiFi-network today, no errors in the past 6 hours Thanks!

melvin.wevers · December 7, 2020, 2:52pm

I stumbled upon this fix: https://github.com/hkarbasi/Mallet
This might allow you to use multiple workers and save you some precious

simon · December 7, 2020, 4:44pm

Cool!

As a personal rule of thumb I always prefer training my topic models in single-core (this way, with a seed, the results are replicable). I usually run several models at the same time (so it’s multicore anyway), but the RAM requirements might be limiting if you use MALLET (in this case through the gensim wrapper).

Note that you can also use the gensim implementation of LDA (or multicore) which uses a Variational Bayes sampling instead of MALLET’s Gibbs sampler. VB will be faster but less precise, and another difference will be that with gensim you’ll stream your data (no RAM limitation then).

alie · December 7, 2020, 5:00pm

Thanks @melvin.wevers - I tried this because it sounded like my problem, but am getting the same error unfortunately.

Thanks for this @simon - will try to figure this out. I have a workaround for now, so that’s something.

dkltimon · December 8, 2020, 9:23am

not 100% related to your question, but might be importatnt. There are known issues associated with C_V coherence measure. So it is recommended to use C_P, NPMI or UCI for evaluating topics. See: https://github.com/dice-group/Palmetto/issues/13 and https://palmetto.demos.dice-research.org/

elibooklover · December 10, 2020, 7:32pm

If you deployed MALLET, make sure that Java was installed properly in your local/cloud server.

Here is the sample code to install Java:
First, set the install function.

def install_java(): !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

After that, install Java.
install_java()

Don’t forget to import os if you didn’t
import os

In addition, I would change the path 'PATH/mallet-2.0-4.8/bin/mallet'

For example,

os.environ['MALLET_HOME'] = '/content/mallet-2.0.8'
mallet_path = '/content/mallet-2.0.8/bin/mallet'

from gensim.models.wrappers import LdaMallet
ldamallet = LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)