Mean centering

ianm's picture

19. Mean centering

This question has to do with mean-centering. The process of mean-centering is to calculate the average spectrum of the data set and subtract that average from each spectrum. Then, regression begins. I can see that in PCR (one of the regressions in PCRA), the first, and probably other, eigenvectors will be also centered around zero, due to the mean-centering of the spectra. But I am not so sure about the first eigenvector in PLS. Will it always be centered around zero? Some of my thoughts are a material may have a large, but non-contributing, band that will change the mean-centered spectra, but the band(s) due to the constituent of interest may vary much less, and thus the eigenvector could be displaced from zero by the non- contributing constituent. Am I wrong?

From: Howard Mark [email protected]

Subject: Re: Mean centering

Bruce - mean-centering is not an arbitrary operation, and its effects are rooted in the some of the fundamental properties of the way data behave.

In the first place, when you do least-square fittings of any kind, you have to be aware of the fact that the mean is itself a least-square estimator. The proof of this is on pages 33-34 of Statistics in Spectroscopy. Thus, if you didn't subtract the mean initially from your spectral data, then unless the data were pathological the first factor you calculated would be the mean, anyway, or very close to it (if other sources of variation were correlated with the mean value then they might distort it somewhat.) So that is not accidental.

Furthermore, in PCA all other factors have to be orthogonal to the first factor - they also have to be orthogonal to each other, but that's immaterial to the current question). Ordinarily, spectra are always positive, so the mean will also be positive at all wavelengths. Thus, all other factors have to make equal and opposite (in sign) contributions after multiplication by the first. This is equivalent to a weighted average, and that weighted average must be zero (the average can be zero if and only if the sum is zero, and vice versa). However, that doesn't mean that the UNweighted sum must be zero, and in fact are not, in the general case - you can check this out the same way I did, by reading a set of PCA factors into EXCEL and use the sum() function to sum over the values in each factor. Contrary to your intuition, the factors themselves need not (and in fact don't) sum to zero, therefore they are not themselves mean-centered.v When you use them to do regression, on the other hand, mean-centering is often applied as a first step, but this is also true for regression in general, not just when using PC factors. The reason is to reduce the number of significant bits in the data, so as to minimize the possibility of intermediate results overflowing the word size of the computer. I once calculated that if the number of bits is equivalent to 12 or 13 decimal digits then it is always enough to do regressions on NIR data, so a program written using double precision will not have a problem, although one using single-precision may sometimes be affected by this, if the data is too highly intercorrelated.

Now in PLS, both the least-square and orthogonality requirements are relaxed, allowing the factors even more freedom to not be mean-centered.


From: Ola Berntsson [email protected]

I'm not sure if I understood the question correctly but here is my answer:

In a PCA (where X=T*P'), the loading vectors (P) are orthogonal [p1'*p2=0 etc] and normalized [sum(P'*P)=1] but NOT centered. The mean of any loading vector, in PLS or PCA, is never set to zero, regardless of what the mean of X is.




Ola Berntsson

Analytical Chemistry

KTH - Royal Institute of Technology

SE-100 44 Stockholm, Sweden

Phone: +46 8 790 8216

Fax: +46 8 10 84 25