RMSEV with smaller values than RMSEC? Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » Chemometrics » RMSEV with smaller values than RMSEC? « Previous Next »

Author Message
Top of pagePrevious messageNext messageBottom of page Link to this message

Kelly Anderson
Posted on Monday, March 04, 2002 - 8:49 am:   

It is my understanding that with a multivariate calibration model, the error for the validation set (RMSEV) should be nearly the same or greater than the error for the calibration set (RMSEC). If the RMSEV is less than the RMSEC, is this generally a problem? Why would the RMSEV be less than the RMSEC? Thanks for your assistance.
Top of pagePrevious messageNext messageBottom of page Link to this message

Bruce H. Campbell (Campclan)
Posted on Monday, March 04, 2002 - 9:40 am:   

Kelly - When the RMSEV can be smaller than the RMSEC it is due to statistical variation. As long as the difference is a small fraction of either, it should not be a problem.
Bruce Campbell
Top of pagePrevious messageNext messageBottom of page Link to this message

Tony Davies (Td)
Posted on Monday, March 04, 2002 - 1:06 pm:   

Hello Kelly,

A smaller RMSEV could suggest that you validation samples are much less variable than the calibration data. How did you select calibration and validation sets and how many samples were there? If you are dealing with small sample sets then it may be just "luck-of-the-draw" but with a reasonable number in both sets then there must be an answer. Have you compared scatter plots of the calibration and validation? Are the distributions the same? Could there be an outlier in the calibration set which is causing the RMSEC to be large?
Best wishes, Tony Davies
Top of pagePrevious messageNext messageBottom of page Link to this message

Don Dahm
Posted on Tuesday, March 05, 2002 - 5:09 am:   

I think that one of the things that happens is that we tend to use the more extreme samples in our calibration set because they are a bit scarce, and then the validation set has more of its samples near the center of the range. The samples at the center of the range tend to have a smaller error. (Imagine doing a least squares fit of three points. On the average, the center point will be closest to the line.)
Top of pagePrevious messageNext messageBottom of page Link to this message

hlmark
Posted on Tuesday, March 05, 2002 - 3:31 pm:   

Don - I'm afraid that's not quite true. The middle point of three has a larger error than the other two. For example, here's a trial of that:

XAct YPred Y   Error
110.98333  0.01666
222.03333  -0.03333
33.13.08333  0.01666


In case it's not self-evident, the column "X" is the data I used for the independent variable, the column "Act Y" is the data I used for the calibration, "Pred Y" is the predicted values from applying the model to the X data.

As you can see, the error for the middle point of the three is larger then the others.

I think that what you were trying to get at is the statistical conclusion from regression theory that errors at the ends of the range of the data are subject to model errors as well as random errors, whereas data in the middle of the range are only subject to random errors. But this is a statistical conclusion that applies only when large numbers of regressions are carried out on similar data sets, so that model errors can be evaluated. When only a single set of regression results is available, that cannot be assessed, nor will it show up in the results of that single regression.

Howard
Top of pagePrevious messageNext messageBottom of page Link to this message

Don Dahm
Posted on Tuesday, March 05, 2002 - 4:59 pm:   

Howard - Well, once again, something that I thought I have observed emprically, and believed that I understood the theoretical basis for is in error. Perhaps I should stick to the theory of diffuse reflectance and leave statistics alone. But yes, I thought I had read that the errors are smaller at the center of the range. In my mind, I can even see the plot in a Figure with the error limits bowing in around the line. - Don
Top of pagePrevious messageNext messageBottom of page Link to this message

hlmark
Posted on Wednesday, March 06, 2002 - 7:28 am:   

Don - you were probably thinking of diagrams such as the one on page 82 of Draper & Smith's "Applied Regression Analysis" (2nd ed), Wiley (1998), which shows the hyperbolic confidence bounds around the calculated regression line. If you read the discussion relating to that diagram (which starts on page 80) it derives an expression showing that V(Y) (the variance of the Y errors) increase with X-X, i.e., as you move further away from the mean of the data.

But going a step back, and asking WHY that error increases, you find that the answer is that, because of the random error in the Y values, the model you derive is never exactly the "true" model, and therefore the results you calculate will depart further and further from the true results the further you get from the center of the data. This is also one of the reasons why extrapolation is not recommended, although this is a side issue here.

More to the point is the fact that if you were to repeat a calibration exercise many times, with different samples each time, every model you generate will be different, again due to the fact that the Y variables have a different set of random errors. Hence an envelope encompassing 95% (or any other confidence level) of the predictions will reflect that increased error at the ends of the range.

But that is all based on the expected behavior of the effect of performing multiple calibrations, and is compared to "truth", which is unknown. In the real world, if you do a single calibration, what you will see will depend on the samples you have. If you have a sample set that follows the recommendations for good sampling, as well as meeting the fundamental assumptions for doing regressions, then what you will see will be a fairly uniform distribution of errors over the range. You won't see the increased error due to use of a non-true model because you don't know "truth" to compare against, neither do you have multiple other models to compare against.

More often, what happens in real sample sets is that the preponderance of samples are in the middle, with only a few samples at the extremes. But because of the behavior of the regression math, samples at the extremes have an inordinate amount of influence on the nature of the model obtained, and therefore often "pull" the model in a direction that minimzes their own errors. For the case of the single model, you have nothing else to compare your results against. In a sense, a single calibration is somewhat more nearly a meaure of precision than accuracy (although it's not really precision, either), because you're comparing two quatities which both have error, and can't compare against the unknown "true" values.

Howard
Top of pagePrevious messageNext messageBottom of page Link to this message

Brian Penttila (Bjpenttila)
Posted on Thursday, March 07, 2002 - 4:28 pm:   

Concerning Howard's interpretation of Draper & Smith:

Howard is almost 100% right, except that the uncertainty of predictions of Y for new X's IS greater in absolute terms at the extremes (and usually greatest in relative terms at the smallest X).

I believe the stuff about "truth" and multiple calibrations in Howard's explanation is a little off-target. Every attempt at calibration with small data sets is an estimate of the "truth." One set of samples gives you one estimate of the truth. If you can take a new set of samples from the same population, you will get a different estimate of the "truth." This second estimate is more likely to be different at the extremes than at the center.

The hyperbolic confidence intervals reflect this uncertainty for any one experimental estimate of the "truth" (one batch of X and Y's sampled from a larger population). This does not stem from trying to get "a statistical conclusion" [see Howard's first response to Don], but from the leverage of data at the extremes (as described in Howard's message) and the uncertainty in estimating that leverage with limited samples from a population. The first estimate of the model is only one of many estimates that could have been constructed from the same sample population had you selected samples differently. Since you are (presumably) applying the model to the larger unknown population, the uncertainty will show up not as a "statistical conclusion" but as a real effect.

Howard's example with three data points is artificially constructed to illustrate his point. I can construct other artificial examples to illustrate anything I want:
X Y
1 1
2 2
3 3
4 4
5 5.1
Error is largest at X=4 and X=5

X Y
1 1.1
2 2
3 3
4 4
5 5
Error is largest at X=1 and X=2

In real life examples, the hyperbolic shape WILL SHOW UP experimentally, but can be affected by hidden relationships or constraints in the data.
Top of pagePrevious messageNext messageBottom of page Link to this message

hlmark
Posted on Thursday, March 07, 2002 - 5:07 pm:   

In response to Brian's message, I agree with what he says; it is essentially what I was trying to say, but not as successfully.

I must point out, however, that the "artificially constructed set of three data points" was not my creation, but Don's, who attributed incorrect behavior to it, which I felt compelled to address.

WIth data sets of more than three samples, it is still true that the data at then ends will have more influence, but then that influence can be overcome by the preponderance of the data in the middle, so that the net behavior becomes dependent on the details of the overall makeup of the set, as Brian is demonstrating.

I stand corrected in not properly describing the differences between what will be observed on the calibration data (which is what I had in mind) and prediction data, as Brian shows.

Howard

Add Your Message Here
Posting is currently disabled in this topic. Contact your discussion moderator for more information.