NIR Discussion Forum: Definition for RMSEC

Definition for RMSEC Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » Chemometrics » Definition for RMSEC

« Previous Next »

Author

Message

Daniel Fraser (Dgfraser)

Posted on Tuesday, May 08, 2001 - 4:15 am:

RE: Definition for RMSEC.

I generally use RMSEP to evaluate may calibrations on an independent set and RMSEC plays only a minor instructive role. However I have been confused as to what is the correct definition for RMSEC. I have observed two different versions the first is

RMSEC = SQRT(SUM[(Yi�-Yi)^2]/n)

Where:
SQRT is the square root function,
SUM[ ] is my summation symbol for i = 1 to n (capital sigma).
Yi� is the predicted response value for a sample within the set of values used to make the model (calibration).
Yi is the true response value (or in reality the measured response value).
^2 Means to the power of 2.
n is the number of samples in the calibration.

The second is:

RMSEC(pls) = SQRT(SUM[(Yi�-Yi)^2]/(n-f-1))

Where:
RMSEC(pls) indicates PLS was used to form the calibration model.
f is the number of factors in the model.

The first version I have observed is described in the unscrambler (ver. 6) user manual (p347) and the Matlab toolbox 2.0 (p88). It seems nice and tidy as it has the same form as RMSEP = SQRT(SUM[(Yip�-Yi)^2]/n) where Yip� is the predicted response value for a sample from an independent set to the calibration set used to make the model.

The second form is referred to in two different books. Marten and Naes (p253) define MSEE = SUM[(Yi�-Yi)^2]/(n-df) where df is the degrees of freedom used in fitting the regression. For PLS they suggest that df should be at least (A+1) where A is the number of factors in the PLS. They say �at least� because y is used in both the finding of the factors and the loadings (and refer to paper Martens and Jensen 1983). They go on to define SEC to be SQRT(MSEE) hence my second definition above. The book by Richard Kramer (p170) also concurs with this definition of RMSEC for a PLS method. (watch out for the couple of incorrect equations (though described correctly) on page 170).

So which one should be used?!!

hlmark

Posted on Tuesday, May 08, 2001 - 7:41 am:

Daniel - in theory or practice?

In a practical sense, it's not going to make too
much difference, the calculated values from the
two expressions will differ by a factor of:

sqrt ((n-1) / (n-f-1))

and except for very small values of n this ratio
will be very close to unity. For example; even
with as few as 20 samples and 4 factors, the
ratio is:

sqrt (19 / 15) = sqrt (1.266) = 1.12

and I hope you are using more samples than that!

In theory it does make a difference, and the
expression using the degrees of freedom is the
correct one. The use of degrees of freedom is
not an arbitrary decision on the part of the
statisticians. It ensures that the quantities
that you calculate have desirable fundamental
statistical properties, including unbiasedness,
and being maximum likelihood estimators.

To see what the difference is, let's look at a
thought experiment: doing a calibration with 2
samples and one variable (or 3 samples and two
variables, etc.). In these cases, your model
will fit the data exactly, as we all know, and
therefore have an SEC of exactly zero.. But then
let's go to the next step and ask: what is the
estimated error of predicting unknowns (which is
the question that the SEP, RMSEP, PRESS, etc.
are intended to answer)?

Now, if we use n-1, then calculating
/(n-1) gives a value of 0 / 1 = 0, which
is clearly not the correct answer, since we do
not expect to be able to predict unknowns with
zero error, even though we were able to
"calibrate" that way.

If we use n-f-1 then the calculation yields
/(n-f-1) = 0/0 which is indeterminate and
is the correct answer, since we have no
information about the error and therefore no way
to tell what the error is going to be.

Howard

hlmark

Posted on Tuesday, May 08, 2001 - 8:00 am:

To all - whoops, did I goof!! In the discussion of the theory of calcullating standard errors, I got my SEC's and SEP's and RMSEP's mixed up and didn't realize it until I got the message back. The argument about a computed standard error, distinguishing the cases 0/1 from 0/0 are correct for the SEC, since the SEC is intended to give an estimate of the prediction error from the calibration data. Sorry 'bout that!

The correct denominator term to use for the calculation of SEP does not depend on the properties of the model. Since you are doing the calculation based on values already predicted (instrument - actual, let's call these I and A), the model does not enter into it. What does make a theoretical difference, however, is whether you make any corrections after predicting the concentration values.

Thus, if you calculate sum((I-A)^2) for a set of samples, then the correct denominator is n. If you perform a bias (offset) correction using the mean difference between instrument and refercne values, however, then the correct denominator is (n-1). If you go further and in addition do a skew (slope) correction, then the correct denominator is (n-2)

Howard