Scaling word frequencies when using PCA

joris · April 29, 2020, 2:41pm

Hi all,

I’m reviewing the use of PCA on word frequencies, and it may seem like a minor detail, but I am wondering whether scaling the word frequencies to have a mean of zero (and std. of 1) before applying PCA could (undoubtedly) be considered best practice, and I’d like to hear your thoughts on that.

From the articles I’ve read, almost all authors seem to agree that before performing PCA on word frequencies, these frequencies need to be normalized, be it by dividing the frequencies of the words in a document by the total number of words in that document, or by using TF-IDF (or some other variation). Some authors, however, also scale these normalized frequencies by using z-scores, mainly to reduce the influence of high frequency words.

Introductory text books generally seem to recommend scaling the variables, but the examples that are given are mostly concerned with tabular data, in which the variables are radically different (height, weight, income). Furthermore, James et al. (p. 382) state: ‘In certain settings, however, the variables may be measured in the same units. In this case, we might not wish to scale the variables to have standard deviation one before performing PCA’. Isn’t this also the case when using normalized frequencies? (If you multiple these frequencies by, say, 10.000, each variable would be measures as ‘frequency of word X per 10.000 words’)

I ran some experiments (notebook+data here) trying to reproduce Andrew Piper’s use of PCA on the chapters of Augustinus’ Confessions (or more specifically: trying to reproduce Nan Z. Da’s reproduction of Piper, who argued that Piper didn’t ‘properly’ scale his variables). The effect of scaling the frequencies is quite drastic, and would arguably result in a fundamentally different interpretation of the relationships between chapters.

Some other relevant observations from the notebook:

When you don’t scale the frequencies, high frequency words indeed tend to have a very large influence on the outcome of the PCA; when you scale them, there is no clear relationship between component weight and frequency. This is a strong argument for scaling the variables, but it seems to come at a price: you would, I believe, have to assume that the actual percentual contribution of a word doesn’t matter at all (also see the dummy example, where two words (fog/fat) that occur once are responsible for most of the differences).
The above seems to result in a model where words that do not differ significantly between texts (according to a not so robust LL-test) also have a relatively high weight.
If you remove those non-significant words, the PCA output, again, changes quite drastically.
If you use pooled fastText embeddings instead of BoW features, the resulting PCA more or less looks like a middle ground between the scaled and the unscaled PCA plots.

It might also be relevant here that the goal (or Piper’s goal) seems more thematically than stylometrically orientated (which explains why he removed stop words). Anyways, I’d like to hear your thoughts!

Joris

mjlavin80 · April 29, 2020, 4:40pm

Joris, there’s obviously a lot to your question. I’m working through it for myself, so I’m going to go one at a time. I’ve read Piper’s book and Da’s critique of Piper et. al., but it was a while ago for both.

These are my initial impressions. Apologies if I’ve misunderstood anything you’re asking. Happy to be corrected.

Should PCA on a document’s term frequencies use scaled frequencies instead of raw term frequency counts?

I think the answer to this is yes because the question is whether you want to normalize for the length of each document. If we believe documents/chapters of similar length are actually more similar to each other, this would not be true. Scaling should target this aspect of the data. Raw counts would skew the results by doc length, specially with very short or very long documents. Sometimes folks think removing stop words will solve this issue, but I don’t think it does.

Should on a document’s term frequencies use a transformation/weighting such as TF-IDF? If so, what is the impact?

TF-IDF is implicitly an interpretation of what’s important in a document. It elevates words that are some but not all documents. By using TF-IDF and running PCA, we are asking a different research question … instead of which documents/chapters are the most generally similar, we’re asking which documents/chapters are the most similar in terms of their use of distinctive words? I don’t think there’s an absolutely right or wrong answer to this question unless your study design is predictive. Then the best way is whatever predicts the label best. The most important thing is probably to adjust one’s interpretation to that which is being measured.

Should normalization be used on a document’s term frequencies? If so, which normalization should be preferred?

There seems to be a lot of debate out there about whether PCA assumes normally distributed data, but I don’t think it does, as it’s considered a non-parametric test (https://www.researchgate.net/post/Hello_everyone_Is_it_possible_to_do_PCA_analysis_if_data_is_not_normally_distributed). That said, term frequencies are not normally distributed, so recalculating them as z-scores seems like a mistake. If you are talking about using the z-score of a feature’s frequency relative to all other documents, that seems sound, but I haven’t seen it done this way. Instead, I would think one would use log normalization of scaled frequencies by document, since the distribution of a sufficiently large document is close to log-normal.

Birdseye view … did Piper fail to “properly scale” his variables?

I don’t think it’s that simple. The issue here is the force with which Da claims that this was an error, as opposed to arguing that Piper perhaps should have pursued more avenues of triangulation. A good result should be resilient, right? If I change the assumptions slightly, then some aspect of the generalization should remain. Alternatively, if I try numerous methods, all potentially valid, the best finding should be in many/most of the results. Da calls her version the “corrected version,” which I think is misleading.

Is PCA the best method for this question?

You didn’t ask this, but it’s my overriding question. If the goal is to see which chapters of Augustine’s “Confessions” are the most similar to one another, aren’t there more direct ways to do this without decomposing/reducing the number of dimensions you consider? Perhaps cosine similarities as a many-to-many graph, and then use network analysis methods? Or use an unsupervised clustering algorithm on the document vectors? If the goal is to discover unknown root factors that inform similarity, PCA would make sense to me, but I don’t think that was the goal here.

ash · April 29, 2020, 6:10pm

Hi!

I’m not sure about PCA in particular, but since word frequencies follow power-law somewhat naturally, it’s fine, i guess, to disregard their relative “part” in a text and focus on variation in their usage in the given context. But again, it is common to use some culling (based on frequency rank, etc.) instead of full feature vectors filled with 0s.

And yes, I believe there was originally a mismatch between intent and methodology.

folgert · April 30, 2020, 7:45am

Good question! I’m not a PCA expert, but I’m familiar with using z-scores in the context of regression analyses. One important reason to use z-scores in that context is the interpretability of the coefficients because it’s much easier to interpret a positive or negative coefficient in relation to a mean of zero. I think the same reasoning applies in the context of stylometric applications of PCA, and especially when plotting loadings. Here too, interpreting values as deviations from a zero-mean makes life easier.

mike.kestemont · April 30, 2020, 12:23pm

(For the petite histoire: one of my first papers submitted to the annual DH conference actually got rejected because I didn’t scale my variables before doing a PCA and the resulting graph contained loadings that clearly revealed that. This shows to illustrate that scaling really matters to textual scholars! )

For my perspective, the concept of regularization should be a part of the discussion. Without scaling, only a couple of features will be doing all the explanatory work (as you say), whereas scaling will smooth out the evidence over many different words.

“you would, I believe, have to assume that the actual percentual contribution of a word doesn’t matter at all” -> I agree with what @folgert says below. It’s the z-scores that matter for the individual words, and how they relate to the zero mean. Check out this paper (ping @fotis ), for instance:
https://academic.oup.com/dsh/article/32/suppl_2/ii4/3865676. Here, the authors show that only the direction (and not even the magnitude) of the z-score often suffice for an attribution.

tunder · May 8, 2020, 12:59pm

I agree with pretty much everything above. Just chiming in to say that scaling word frequencies by converting them to z scores is normal practice for scholars using e.g. Burrows’ Delta and its variants.

I do find that distances between word vectors scaled this way are often more revealing than distances between vectors scaled using tf-idf. But as Matt Lavin says, it ultimately depends on your question and goal. I can imagine there might be questions where you’d want to give more weight to very common words, and then you might not want to “scale” everything as a z score.

But that’s hypothetical—note that the article I linked above (Evert et al.) is about authorship attribution, where common words are often thought important. And yet they find it best to convert to z scores. It seems to be often a good idea.