RPD; how do you spell it? Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » Chemometrics » RPD; how do you spell it? « Previous Next »

Author Message
Top of pagePrevious messageNext messageBottom of page Link to this message

Klaas Faber (faber)
Member
Username: faber

Post Number: 12
Registered: 9-2003
Posted on Thursday, September 27, 2007 - 7:35 am:   

Howard,

I overlooked your remark "It's also a way to detect an effect of X-error (but that's a different discussion that we never finished)."

In connection with PLS, this topic has been treated rigorously in:

B. Nadler and R.R. Coifman
The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration
Journal of Chemometrics, 19 (2005) 107-118

Note that a companion paper has just received the Kowalski prize for best theoretical paper, see:

http://www.spectroscopynow.com/coi/cda/detail.cda?id=16429&type=Feature&chId=9&page=1

As a matter of fact, Nadler and Coifman extend a lot of old theory for the straight-line fit to PLS. More about the effect of X-error is explained on:

http://www.chemometry.com/Expertise/UVC.html

There you will, among others, find expressions for prediction bias when using a straight-line fit which itself is unbiased. This page is admittedly technical but it hopefully gives a reasonable picture of state of the art.

Best regards,

Klaas
Top of pagePrevious messageNext messageBottom of page Link to this message

Klaas Faber (faber)
Member
Username: faber

Post Number: 11
Registered: 9-2003
Posted on Thursday, September 27, 2007 - 7:22 am:   

Sometimes questions come up of a more or less philosophical nature. I mean non-technical. Take, for example, the correlation coefficient that seems to be output by most packages. Why is that?

In a recent paper on detection limits I discussed an incorrect calculation method used by a large portion of WADA-accredited doping labs as follows (download from http://www.chemometry.com/Expertise/LOD.html):

In a classical paper by Bland and Altman [12] (http://www-users.york.ac.uk/~mb55/meas/history.htm: 10,012 citations on the ISI Web of Science, 8 July 2005), a widespread misconception is aptly discussed as follows: �Why has a totally inappropriate method, the correlation coefficient, become almost universally used for this purpose? Two processes may be at work here � namely, pattern recognition and imitation. (�) Journals could help to rectify this error by returning for reanalysis papers which use incorrect statistical techniques. This may be a slow process. Referees, inspecting papers in which two methods of measurement have been compared, sometimes complain if no correlation coefficients are provided, even when the reasons for not doing so are given.�

Replace the "two methods of measurement" by primary and secondary method and the connection with e.g. NIR calibration models is obvious.

Your suggestions!

Klaas Faber
Top of pagePrevious messageNext messageBottom of page Link to this message

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 155
Registered: 9-2001
Posted on Wednesday, September 26, 2007 - 9:13 am:   

Klaas - I just read both of your messages, and I had a hard time to decide which thread to send my comments to.

Changing terminology is not going to be a simple matter. The problem arises because chemists and statisticians have both been using the term "bias", to mean different things, and in both cases the usage (near as I can tell) goes back hundreds of years.

So the inclusion of "bias" in software packages as the term to describe the mean difference between predicted and reference values, is completely in accord with the chemists' use of the term. Since chemists are probably the main users of instruments incorporating chemometric methods, for an instrument (or software) manufacturer to give them a term that they're familiar with, to describe a concept that they're familiar with is not an unreasonable decision.

You can make similar arguments in favor of reserving "bias" for the statistical usage of the term. It never became a problem until "Chemometrics" was born, and brought with it a confluence of the two fields.

The fraction of chemists who deal with chemometrics, and are therefore affected by the dichotomy, is very small; I can't see chemists accepting a change of such an embedded term for the small fraction that are effected.

Similarly, the fraction of statisticians who deal with chemometrics, and are therefore affected by the dichotomy, is also very small; I can't see statistician in general accepting a change of such an embedded term for the small fraction that are effected, either.

The label is so deeply embedded in each discipline, that I just don't see a solution.

On a someshat different theme, I have to argue with your example, however. True enough, if you use 0 PLS components you'll get the same results as if you use 0 PCR comonents, or 0 variables in an MLR calibration: the mean predicted value will equal the mean value of the reference values, and the "bias" (in the sense of mean difference between the two sets) will be zero, and certainly not statistically significant.

In that case, however, a simple regression of the residuals against the reference values would give a statistically significant non-zero slope: what I called "Skew Significance t" (SST!). It's also a way to detect an effect of X-error (but that's a different discussion that we never finished).

I think that simply changing the meaning of the term "bias" isn't going to address the problem, even if we get everyone to accept the new terminology. The issue of what happens with a zero factor calibration isn't going to change regardless of what we call it; the way to address that is to institute a (more or less specific) test for the effect.

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

Klaas Faber (faber)
Junior Member
Username: faber

Post Number: 7
Registered: 9-2003
Posted on Wednesday, September 26, 2007 - 3:18 am:   

Dear all,

This is closely related to a message I justed posted in connection with the thread "Bias formula / calibration".

I quote Andrew: "Now the difference between RMSEP and SEP is pretty important! RMSEP is the total error and is equal to the quadrature addition of SEP and Bias. That is,

RMSEP^2 = Bias^2 + SEP^2.

Typically RMSEP and Bias are first calculated, the Bias being the mean difference between the predicted and actual values and the RMSEP being the root mean squared difference."

The formula is obviously correct (it's a defining equation), but equating bias to the mean difference between predicted and actual values is generally difficult to justify. Bias in that formula is TOTAL bias, not just one of many possible components of TOTAL bias. Moreover, that formula should hold for each individual prediction, i.e. on the sample level instead of the set level. Just assume that you model your data with 0 PLS components - only the average for the calibration set. Then you will underestimate high values and overestimate the low ones, simply because you predict with the average all the times. That's bias - the errors are systematic. But the "mean difference between the predicted and actual values" can be quite small, even "not statistically significant".

I would just:

1. report RMSEP
2. calculate so-called "Bias" (for some applications it is a useful summary statistic)
2. test so-called "Bias" for statistical significance (as people like Howard Mark have advocated for many years, but chemometrics packages simply lack what amounts to a standard t-test)
3. correct RMSEP (and the predictions as well, otherwise you're not consistent) if this so-called "Bias" is significant, and, finally
4. do not call the result SEP unless you are sure that no other bias components contribute to RMSEP.

Unfortunately, current terminology is inadequate.

Best regards,

Klaas Faber
Top of pagePrevious messageNext messageBottom of page Link to this message

Lidia Esteve (veiva)
Junior Member
Username: veiva

Post Number: 9
Registered: 9-2006
Posted on Thursday, October 05, 2006 - 9:02 am:   

Thanks a lot Andrew. I did not know that, I just realized that with my data SEP and RMSEP were very close.
Thank you, I appreciate the help that all you guys provide in this forum!
Top of pagePrevious messageNext messageBottom of page Link to this message

Andrew McGlone (mcglone)
Junior Member
Username: mcglone

Post Number: 8
Registered: 2-2001
Posted on Wednesday, October 04, 2006 - 11:37 pm:   

It's not actually a 'trick' at all, I used the wrong word. It is simply what I choose to do to make results easier to read, less cumbersome, with the presumption that the readers will know what I'm doing.

Now the difference between RMSEP and SEP is pretty important! RMSEP is the total error and is equal to the quadrature addition of SEP and Bias. That is,

RMSEP^2 = Bias^2 + SEP^2.

Typically RMSEP and Bias are first calculated, the Bias being the mean difference between the predicted and actual values and the RMSEP being the root mean squared difference. The SEP is then calculated as

SEP = sqrt(RMSEP^2-Bias^2).

If indeed your Bias values are low then no worries, use either term. I like to use RMSEP as it is the more correct measure of what future performance will be like with new data. But if there is significant Bias then I think it is better to assume it will have to be independently measured somehow, with the model predictions then adjusted. In that case the SEP value is useful as it gives the best picture of ultimate accuracy in that case.
Top of pagePrevious messageNext messageBottom of page Link to this message

Lidia Esteve (veiva)
Junior Member
Username: veiva

Post Number: 8
Registered: 9-2006
Posted on Wednesday, October 04, 2006 - 10:18 pm:   

No,no, thanks Andrew!
I started my research with NIR one year ago, and well, everything is still pretty new for me and all the comments are more than welcome!.I am a kind of self-learner in this topic, and my mistake sometimes is just assume things.
I work with PVC pipes, and my bias are usually small. But I do not understand the difference about the information that SEP and RMSEP report.How did you get to that trick? I though it was a matter of taste or personal prefferences. I just chose SEP because it was easier to understand for me.
Top of pagePrevious messageNext messageBottom of page Link to this message

Andrew McGlone (mcglone)
Junior Member
Username: mcglone

Post Number: 7
Registered: 2-2001
Posted on Wednesday, October 04, 2006 - 8:52 pm:   

Yes, I would use RPD = SDy/SEP unless I wanted a quick indicative answer.

On a tangent to all that, why do you not report Bias? Perhaps it is zero, and then that is fine, but I wouldn't hide it away without very good reason. It is a real measurment error that your predictive model will face in practice. Whether or not Bias is easy to deal with is neither here nor there; it must be dealt with in some way or other if you are to achieve accuracy.
A trick I use is to report RMSEP only, if Bias is close to zero, or else SEP and Bias if Bias is large.
Of course I don't know what you are doing, what your data and purpose is, and so I might be completely out of line; please don't take offence if it seems I'm telling you how to "add 1+1 to get 2"
Top of pagePrevious messageNext messageBottom of page Link to this message

Lidia Esteve (veiva)
Junior Member
Username: veiva

Post Number: 7
Registered: 9-2006
Posted on Wednesday, October 04, 2006 - 7:13 pm:   

Andrew,

My initial concern was how to calculate the RPD. I was calculating it with the R2 equation. So after what you explained, I got the conclusion that the equation with R2 involved is not so reliable, especially when R2 is not "so good". Am I right? When i report the model accuracy I usually give the R2 of calibration, RPD from validation, and SEP. Until now I was calculation the RPD with the R2 equation, since i use the unscrambler and is faster for me this way. But i guess I will have to consider RPD=SDy/SEP....
Top of pagePrevious messageNext messageBottom of page Link to this message

Andrew McGlone (mcglone)
Junior Member
Username: mcglone

Post Number: 6
Registered: 2-2001
Posted on Wednesday, October 04, 2006 - 5:54 pm:   

Lidia, I didn't actually answer your question!

I think you have to make the call on whether RPDc or RPDp; I don't know what your purpose is and I don't know what controls you have in place around your calibration/modelling exercise.

Generally I like to pay close attention to the validation results (i.e., RPDp) as the calibration results can be terribly tied up in the modelling/training exercise and so not be necessarily robust in demonstrating future performance.

If your data sets are big enough, independent validation is always the way to go in my opinion.
Top of pagePrevious messageNext messageBottom of page Link to this message

Andrew McGlone (mcglone)
New member
Username: mcglone

Post Number: 5
Registered: 2-2001
Posted on Wednesday, October 04, 2006 - 5:45 pm:   

I've only ever simply used it, not examined it per se, in publications.

The original publication, I believe, was by Phil Williams:

Williams, PC (1987) Variables affecting near-infrared reflectance spectroscopic analysis. Pages 143-167 in: Near Infrared Technology in the Agriculture and Food Industries. 1st Ed. P.Williams and K.Norris, Eds. Am. Cereal Assoc. Cereal Chem., St. Paul, MN.

A second addition of that handbook was published in 2004. I have a copy and can see that the RPD material is now discussed in a different chapter: Chapter 8 'Implementation of Near-Infrared Technology' (pages 145 to 169) by P. C. Williams. Has an interesting table too(on page 165) which gives the following interpretations for various RPD values:

0 to 2.3 very poor
2.4 to 3.0 poor
3.1 to 4.9 fair
5.0 to 6.4 good
6.5 to 8.0 very good
8.1+ excellent

Of course, those qualifiers are necessarily subjective and will depend on the application


The first science journal report using RPD seems to be in the first volume of JNIRS:

Williams PC and Sobering DC (1993) Comparison of commercial near infrared transmittance and reflectance instruments for analysis of whole grains and seeds. J. Near Infrared Spectrosc. 1, 25-32.
Top of pagePrevious messageNext messageBottom of page Link to this message

Lidia Esteve (veiva)
Junior Member
Username: veiva

Post Number: 6
Registered: 9-2006
Posted on Wednesday, October 04, 2006 - 5:09 pm:   

Yes, thanks for your comments Andrew.
So Andrew, the conclusion would be to use the RPDp= SDyCal/SEP or RPDc=SDyCal/SEC better.
This is an interesting topic, have you ever published any paper about it?
Top of pagePrevious messageNext messageBottom of page Link to this message

Andrew McGlone (mcglone)
New member
Username: mcglone

Post Number: 4
Registered: 2-2001
Posted on Wednesday, October 04, 2006 - 5:06 pm:   

I would say for a science publication you would, at the least, write it;

'RPD (ratio of standard error of prediciton to sample standard deviation)'.

That way makes sense in terms of the order of letters in the acronym but is back to front in terms of how the ratio is actually calculated, namely the standard deviation is divided by the error of prediction. We probably need to ask Phil why it is RPD and not RDP!
Top of pagePrevious messageNext messageBottom of page Link to this message

Cesar Guerrero (cesar)
Junior Member
Username: cesar

Post Number: 6
Registered: 3-2006
Posted on Wednesday, October 04, 2006 - 2:58 pm:   

Thank you Andrew.
I didn't read your post (it was posted practically at the same time :-)
Top of pagePrevious messageNext messageBottom of page Link to this message

Cesar Guerrero (cesar)
New member
Username: cesar

Post Number: 5
Registered: 3-2006
Posted on Wednesday, October 04, 2006 - 2:54 pm:   

Many thanks Pierre.
But, which of these 3 options is better for a scientific paper?
a) residual predictive deviation
b) ratio of prediction to deviation
c) ratio performance deviation

Many thanks again!
Best wishes
Cesar
Top of pagePrevious messageNext messageBottom of page Link to this message

Andrew McGlone (mcglone)
New member
Username: mcglone

Post Number: 3
Registered: 2-2001
Posted on Wednesday, October 04, 2006 - 2:50 pm:   

I'll wade in here, but it has been a few years since I worried about this point. Yes the RPD numbers are different. My recollection is that this is because with NIR calibration the SEC value that you calculate is not the SE you would get from doing a simple regression between predicted and actual values. Pierre's formula would work exactly, give exact agreement, for simple regression between actual and predicted, but when the SEC is calculated through PLS analysis using the hidden dimensions (the spectral dimensions) then it doesn't apply. Well, perhaps it would if the slope of the validation data was exactly 1, but I'm not 100% sure about that even. Generally however, if the agreements are moderate to high, like R2 values are good, then I've found the formula and calculation are usually close.
Anyway, that is my quick thought on the topic. The disagreement was something that concerned me some years ago when I first started using the parameter. Actually back then I didn't know the RPD term, invented by Phil Williams I believe, and instead called it SDR (standard deviation ratio). When I finally discovered Phil's term I emailed him and apologised profusely for the apparent duplication and promised never to use my term again myself! Unfortunately I've seen it pop up now and again in other publications but hopefully it is merely dying a slow death.
Top of pagePrevious messageNext messageBottom of page Link to this message

Lidia Esteve (veiva)
New member
Username: veiva

Post Number: 5
Registered: 9-2006
Posted on Wednesday, October 04, 2006 - 1:31 pm:   

The formula makes sense, Pierre. It is strange though that results do not always match when using one or other equation. Do you think this is due to statistic packages "rounding"?
Top of pagePrevious messageNext messageBottom of page Link to this message

Pierre Dardenne (dardenne)
Intermediate Member
Username: dardenne

Post Number: 17
Registered: 3-2002
Posted on Wednesday, October 04, 2006 - 1:13 am:   

Lidia,

Perhaps I am wrong, it is always possible, but

R2= coefficient of determination = ratio of the total variance to the explained variance

R2= (SDy2 - SEC2)/SDy2

R2=1-(SEC2/SDy2)

1-R2=SEC2/SDy2

RPD=SDy/SEC=1/sqrt(1-R2)

Pierre
Top of pagePrevious messageNext messageBottom of page Link to this message

Lidia Esteve (veiva)
New member
Username: veiva

Post Number: 4
Registered: 9-2006
Posted on Tuesday, October 03, 2006 - 2:22 pm:   

Dear Pierre,

My colleague Benoit was commenting me that RPD=1/SQRT(1-R2)does not correlate well with RPDp= SDyCal/SEP or RPDc SDyCal/SEC according to Igor K. studies. It seems that the results are different.
I have been using RPD=1/SQRT(1-R2) in my research all the time and now i am not sure about how strong the statistic can be if i calculate it this way. Where does the RPD=1/SQRT(1-R2) come from?

Thanks,

Lidia
Top of pagePrevious messageNext messageBottom of page Link to this message

Pierre Dardenne (dardenne)
Member
Username: dardenne

Post Number: 16
Registered: 3-2002
Posted on Tuesday, October 03, 2006 - 2:29 am:   

Cesar,

I would say
ratio of standard error of prediction (or calibration) to standard deviation (of the parameter to be predicted)

RPDp= SDyCal/SEP or RPDc SDyCal/SEC

Notice that I use SDyCal for both because most of the time the range of the cal set is wider than the range of the test set. You can have an excellent prediction performance with not range at all (SDyPred<<<).

Remember also that RPD=1/SQRT(1-R2)

Pierre
Top of pagePrevious messageNext messageBottom of page Link to this message

Cesar Guerrero (cesar)
New member
Username: cesar

Post Number: 4
Registered: 3-2006
Posted on Monday, October 02, 2006 - 2:15 pm:   

Dear all,
I've been reading different definitions of the RPD (ratio standard deviation to standard error of validation):
- residual predictive deviation
- ratio of prediction to deviation
- ratio performance deviation

Which of them are more appropiate??
Can you advice me?
Thank you very much in advance!!
Best wishes

Cesar

Add Your Message Here
Posting is currently disabled in this topic. Contact your discussion moderator for more information.