Dear all,
I am building a topic model (using gensim
) from a large corpus of texts, and I use the coherence value to decide on the number of topics I want the model to build.
I use the following formula to compute the coherence value for various numbers of topics:
def compute_coherence_values(dictionary, corpus, texts, limit, start, step):
"""
Compute c_v coherence for various number of topics
Parameters:
----------
dictionary : Gensim dictionary
corpus : Gensim corpus
texts : List of input texts
limit : Max num of topics
Returns:
-------
model_list : List of LDA topic models
coherence_values : Coherence values corresponding to the LDA model with respective number of topics
"""
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
model = gensim.models.wrappers.LdaMallet('PATH/mallet-2.0-4.8/bin/mallet', corpus=corpus, num_topics=num_topics, id2word=dictionary, iterations=ITERATIONS,
workers=N_WORKERS, optimize_interval=OPTIMIZE_INTERVAL)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=tokenized_texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
When I run the following line of code,
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=tokenized_texts, start=10, limit=100, step=10)
I get this error at some point:
type: 13520 new topic: 2
0:9 5:8
java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
at cc.mallet.topics.WorkerRunnable.sampleTopicsForOneDoc(WorkerRunnable.java:552)
at cc.mallet.topics.WorkerRunnable.run(WorkerRunnable.java:275)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
I tried to rerun the code several times, and the error always occurs at a different point.
I have used this method before and it worked fine at that time. So far existing threads on Github and StackOverflow are not clarifying it for me, so I hope there are one or two experts in this room who are better in understanding errors than I am
Best,
Alie