Choosing the "correct" number of factors

ianm's picture

20. Choosing the "correct" number of factors

Here is the original question and several responses to it. I have expanded some of the uses that I should have included in the question at first; the expansions are after the responses.


One of my rules of thumb in using PLS for calibration is that the number of PCs (or factors) to use is at least as great as the number n from a scores plot. The scores plot, PC(n) vs PC(n+1), has structure when n is small, but at some point of increasing n, structure essentially disappears and the distribution of points in scores space is random. It is at this point, as measured by the n value for PCs, that I obtain better calibrations. I was told this generally holds, but don't remember who said it or where I saw it.

Is this the observation of others? Is there a theoretical basis that I have forgotten about (or never knew)? Replies would be appreciated, of course.

From: Howard Mark [email protected]

Bruce - well, basically: when the structure disappears then you're at the point where all that's left to model is the noise, which of course, is random. While you're seeing structure then there are real, systematic effects being modeled.

My concern would be that at least some of the time, you're going too far and including factors that, while they do contain information, do not contain enough information, or relevant information, and therefore bring as much "noise" into the model as they account for real effects - i.e., what we loosely call "overfitting".


From: [email protected]

I think this is true for the T versus U scores (scores versus residuals really) plot. The structure in the n versus n+1 plot becomes difficult to interpret after the first few PC's.

From: Richard Kramer [email protected]

Generally, this method would seem to make sense, but it begs the question "How do you determine whether or not a particular score plot has structure?"


Clarifications: When I am using the "disappearance" of structure in a scores plot, I use the number of factors/components that either (1) just gives me what I visually see as probable randomness, compared to the scores plot that was using one less factor, or (2) compare the scores plot with a plot of variance against number of factors. Often, and maybe always(?), the minumum in a variance per factor plot is at or close to the same number of factors where I perceive the randomness is widespread in a scores plot. This correlation makes sense to me, as the randomness, as Howard correctly points out, in a scores plot is due to noise - about which we have to be careful. In a variance plot, the increase in variance is due to noise, so there should be a correlation of number of factors between the two plots.

I don't use more factors than exist for a minimum in the variance plot, and often one or two less. And when consulting a scores plot, I often use one less than the number of factors giving what visually appears to be complete randomness. Even then, I look at other measures before deciding upon how many factors to use.

Bruce - second round comments, mainly in response to Rich Kramer's. I'd sent this to him privately also, but I think that perhaps it should also go to the group:

For a long time now I've been trying to promote the idea that the Chemometrics community can and should learn from the statisticians, instead of looking down on and ignoring it just because in our arrogance we think we know more than them becaue we can sling around matrix equations. Indeed, that's the reason the first 10 years or so of the column was called "STATISTICS in Spectroscopy".

Richard's question makes a perfect foil for this - one of the key underlying functions of Statistics as a science is just to distinguish those cases where real systematic effects are operating versus those cases where apparent effects are due only to noise and random factors, and to do so in an objective, systematic, scientific manner.

Mark Twain's comment about Statistics is cute, but if you've ever seen a competent Statistician work, you would quickly appreciate how the proper application of Statistics can lead you through the thicket of seemingly impenetrable problems in data analysis, and realize that for all his good work, in this case he is the one talking through his hat.

When I spent a good part of one of the recent columns trying to promote this point of view, one of the responses I received was on the order of: "why did you waste so much space before getting to the important part?". What I find very saddening about this is that that WAS the important part, and even among the readers of the column, who have given us kudos for the statistics as well as the chemometric discussion, they didn't recognize that.

To comment on the expansion of your question: stopping at a point corresponding to one or two fewer factors than the computer indicates to be the limit is a good idea in general. The problem with that procedure (from the purist scientific point of view) is that may not be optimum in any given case. One way to improve on that is to devise a measure of the degree of improvement; the Statisticians have shown us that the objective way to evaluate this is to look at the way the sums of squares are distributed - some of Malinowski's work takes this approach.

The main difficulty with this approach is that, in trying to eke out the maximum sums of squares in the model, it is also subject to overfitting. "Overfitting" is also something we don't have a good definition of - it's on the order of "I know it when I see it", but if we could generate a proper definition, then we could use that to create objective criteria for determining whether it exists.


From: Richard Kramer [email protected]

The approach you describe is a reasonable way to find the general approximation of the optimum rank to use. But...

One of the problems I have with this whole discussion is that "optimum basis space" hasn't been defined. All too often, and apparently for the purposes of this discussion, "optimum basis space" is defined as "the basis space for a calibration comprising the first n factors, where n is chosen to produce the best calibration." I have three problems this definition:

1. It does not allow for the exclusion of any of the first n factors;

2. It does not consider that the optimum basis space at time = 0 may not be the optimum basis space at some later time. Often the most robust calibrations are NOT the best calibrations at time = 0;

3. It otherwise does not define what is meant by "best calibration."

And ...

It also fails to consider that it is possible to have factors which have captured important systematic variance in the data which is present at a magnitude lower than the magnitude of the noise captured by that factor. This is often the case, as is apparent when a cross-validation plot shows a nice minimum in PRESS at, for example 5 factors, but shows an even deeper minimum at, say, 11 factors. We had a discussion here some months ago about whether, in such cases, it is better to go with the 5 factor or the 11 factor model. The approach under discussion presently will always choose the 5 factor model without considering the relative merits of the 11 factor model.