I’m reviewing the use of PCA on word frequencies, and it may seem like a minor detail, but I am wondering whether scaling the word frequencies to have a mean of zero (and std. of 1) before applying PCA could (undoubtedly) be considered best practice, and I’d like to hear your thoughts on that.
From the articles I’ve read, almost all authors seem to agree that before performing PCA on word frequencies, these frequencies need to be normalized, be it by dividing the frequencies of the words in a document by the total number of words in that document, or by using TF-IDF (or some other variation). Some authors, however, also scale these normalized frequencies by using z-scores, mainly to reduce the influence of high frequency words.
Introductory text books generally seem to recommend scaling the variables, but the examples that are given are mostly concerned with tabular data, in which the variables are radically different (height, weight, income). Furthermore, James et al. (p. 382) state: ‘In certain settings, however, the variables may be measured in the same units. In this case, we might not wish to scale the variables to have standard deviation one before performing PCA’. Isn’t this also the case when using normalized frequencies? (If you multiple these frequencies by, say, 10.000, each variable would be measures as ‘frequency of word X per 10.000 words’)
I ran some experiments (notebook+data here) trying to reproduce Andrew Piper’s use of PCA on the chapters of Augustinus’ Confessions (or more specifically: trying to reproduce Nan Z. Da’s reproduction of Piper, who argued that Piper didn’t ‘properly’ scale his variables). The effect of scaling the frequencies is quite drastic, and would arguably result in a fundamentally different interpretation of the relationships between chapters.
Some other relevant observations from the notebook:
- When you don’t scale the frequencies, high frequency words indeed tend to have a very large influence on the outcome of the PCA; when you scale them, there is no clear relationship between component weight and frequency. This is a strong argument for scaling the variables, but it seems to come at a price: you would, I believe, have to assume that the actual percentual contribution of a word doesn’t matter at all (also see the dummy example, where two words (fog/fat) that occur once are responsible for most of the differences).
- The above seems to result in a model where words that do not differ significantly between texts (according to a not so robust LL-test) also have a relatively high weight.
- If you remove those non-significant words, the PCA output, again, changes quite drastically.
- If you use pooled fastText embeddings instead of BoW features, the resulting PCA more or less looks like a middle ground between the scaled and the unscaled PCA plots.
It might also be relevant here that the goal (or Piper’s goal) seems more thematically than stylometrically orientated (which explains why he removed stop words). Anyways, I’d like to hear your thoughts!