Discussion on how to choose the number of principle components to use

ianm's picture

2. Discussion on how to choose the number of principle components to use.

Here is the original question and several responses to it. I have expanded some of the uses that I should have included in the question at first; the expansions are after the responses.


One of my rules of thumb in using PLS for calibration is that the number of PCs (or factors) to use is at least as great as the number n from a scores plot. The scores plot, PC(n) vs PC(n+1), has structure when n is small, but at some point of increasing n, structure essentially disappears and the distribution of points in scores space is random. It is at this point, as measured by the n value for PCs, that I obtain better calibrations. I was told this generally holds, but don't remember who said it or where I saw it.

Is this the observation of others? Is there a theoretical basis that I have forgotten about (or never knew)? Replies would be appreciated, of course.

From: Howard Mark

Bruce - well, basically: when the structure disappears then you're at the point where all that's left to model is the noise, which of course, is random. While you're seeing structure then there are real, systematic effects being modeled.

My concern would be that at least some of the time, you're going too far and including factors that, while they do contain information, do not contain enough information, or relevant information, and therefore bring as much "noise" into the model as they account for real effects - i.e., what we loosely call "overfitting".


From: [email protected]

I think this is true for the T versus U scores (scores versus residuals really) plot. The structure in the n versus n+1 plot becomes difficult to interpret after the first few PC's.

From: Richard Kramer

Generally, this method would seem to make sense, but it begs the question "How do you determine whether or not a particular score plot has structure?"


Clarifications: When I am using the "disappearance" of structure in a scores plot, I use the number of factors/components that either (1) just gives me what I visually see as probable randomness, compared to the scores plot that was using one less factor, or (2) compare the scores plot with a plot of variance against number of factors. Often, and maybe always(?), the minumum in a variance per factor plot is at or close to the same number of factors where I perceive the randomness is widespread in a scores plot. This correlation makes sense to me, as the randomness, as Howard correctly points out, in a scores plot is due to noise - about which we have to be careful. In a variance plot, the increase in variance is due to noise, so there should be a correlation of number of factors between the two plots.

I don't use more factors than exist for a minimum in the variance plot, and often one or two less. And when consulting a scores plot, I often use one less than the number of factors giving what visually appears to be complete randomness. Even then, I look at other measures before deciding upon how many factors to use.