RPD revisited Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » Chemometrics » RPD revisited « Previous Next »

Author Message
Top of pagePrevious messageNext messageBottom of page Link to this message

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 382
Registered: 9-2001
Posted on Thursday, December 23, 2010 - 4:15 pm:   

Bob - since it's already here, we might as well keep it here, unless you'd rather just do it via e-mail.

But since there'll be a hiatus between now and when you're ready to pick the topic up again, may I make a suggestion: if you're not already familiar with it, that you read up on the statistical technique of ANOVA (Analysis of Variance), to find out why statisticians are so enamored of working with variances rather than SDs despite the fact that they're not in the same units as the data.

Also, a final comment: I'm not against RPD per se, only the fact that nobody knows what their distribution function, or their distribution, is, and that without that information throwing the term around is nearly meaningless as a scientific quantity.

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

Bob Jordan (jordan)
New member
Username: jordan

Post Number: 3
Registered: 3-2003
Posted on Thursday, December 23, 2010 - 4:01 pm:   

Howard and others. This has been an interesting discussion unfortunately interrupted for me by a pre-Xmas work rush. I would like to continue the discussion in the new year. However, now it has gotten significantly off thread (or at least the things I raise have). I wrote a couple of pages of thoughts on RPD, Ya and on analysis of standard deviation etc. which are too large and probably of too low an interest to NIR readers to put here but may I raise them with you early in 2011?

This is a topic I (and my colleague Andrew McGlone) have pondered on for the last 6 or 7 years and we believe it is important, But where to put it?
Top of pagePrevious messageNext messageBottom of page Link to this message

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 381
Registered: 9-2001
Posted on Tuesday, December 21, 2010 - 9:16 am:   

Jeff - the distribution wouldn't matter if the analyte values were linearly related to the spectrosopic values. But I recently published a paper showing that for most of the common types of reference lab measurements, there is an intrinsic non-linearity in the data, even when Beer's Law holds (see Appl. Spect., 64(9), p.995-1006 (2010)). This non-linearity is over and above the non-linearities caused by scattering effects in powders, and is not affected by any transforms applied to the spectral data.

In the face of non-linearity, the distribution of the data may very well affect the ranking system.

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

Jeff Kallestad (jkallestad)
New member
Username: jkallestad

Post Number: 2
Registered: 12-2010
Posted on Monday, December 20, 2010 - 4:52 pm:   

Daniel, please note that the 3 case example in the original post is hypothetical. I wasn't testing differnt approaches to select a calibration sample set, I am merely questioning the relevance of the use of Y varianve in the RPD statistic, which is often used to compare different models.

Howard, and Bob:
It seems to me that the distribution of Y variables used in a PLS model training set should not influence a ranking system used to evaluate the utility of that model for future predictions, as in the RPD statistic. It seems irrelevant to me. Sure, a ratio of Y variance to prediction error variance corrected for bias produces a value that says something about that model. I just thought that a value for model utility ranking should be about averaged estimated prediction error (SEperformance or RMSEP) with respect to a fixed point of reference, not relative to variance of the Y values in the training set. Perhaps something like a ranking system based on the model RMSEP expressed as a percentage of the mean analyte value (MAV)for the model could be developed �yes, developed empirically and with appropriate statistical tests, as Howard pointed out in the first response. Y variable distribution is built into this percentage, but in an indirect way that cancels out. So, as an untested example: models having RMSEP between 0.5% - 1 % of MAV may be good enough for quality control, RMSEP between 1 and 2.5% MAV may be acceptable for screening to generate reference data for genetic analysis, RMSEP between 2.5 and 5% MAV are good enough for coarse screening or ranking, and models with a RMSEP >15% MAV probably shouldn�t be used at all�etc. I�ve probably overlooked something�any thoughts?
Top of pagePrevious messageNext messageBottom of page Link to this message

Daniel Alomar (dalomar)
Junior Member
Username: dalomar

Post Number: 6
Registered: 2-2009
Posted on Monday, December 20, 2010 - 3:58 pm:   

Jeff,
You tested different approaches to select the calibration set in order to develop prediction (ranking) models. A question: how much did the samples (I assume they are genetic lines) change in their position in the ranking, once you predicted their analyte value with the different models developed?
Daniel
Top of pagePrevious messageNext messageBottom of page Link to this message

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 380
Registered: 9-2001
Posted on Sunday, December 19, 2010 - 6:35 pm:   

Jeff, now I think I see where you're coming from. For starters, we can answer your direct question "... who measures variability using variance?" The answer is "every statistician", and for a good reason, that you alluded to yourself when you said "... Because it is effectively a standard deviation one cannot add it algebraicly". The point is, of course, that variances CAN be added algebraically, and that makes all the difference. And when you say that "one must use vector addition" you're saying that

[Total variability] = sqrt (variability1 ^ 2 + variability2 ^ 2)

and since the variabilities are (again by your own statement) standard deviations, and variances are the squares of the standard deviations, this is the same as saying

[Total variability] = sqrt (variance1 + variance2)

The only difference between your formulation and the standard statistical formulation of it is that the statisticians would square both sides of the equation so to get

[Total variance] = variance1 + variance2

I don't see that there's such a big difference between the two viewpoints. The MUCH bigger difference and more important point, in regard to the original question, is that the distributions of variances are known, while the distribution of RPDs is still unknown, so that you can't set confidence limits or do statistical hypothesis testing using RPDs.

But once you have the equation in the form

[Total variance] = variance1 + variance2

Then you can divide both sides of the equation by the total on the LHS:

1 = (variance1 / [Total variance]) + (variance2 / [Total variance])

and it becomes clear that the fractions of the total variance represented by the two terms comprise the entirety.

And while I agree with you that to talk about "fraction of variability" makes no sense, from the above expression we find that talking about "fractions of variance" makes a lot of sense, because the sum of the two fractions makes up the whole.

The only remaining sticking point is that the values involved are not in the same units as the original data. This often makes it less intuitively clear what the relations to the original data are, which is unfortunate.

Also, I can't put too much stock in the Wikipedia definition for "variability", not only because it's circular, but more importantly because it doesn't lend itself to a mathematical definition. "Variance" on the other hand, is mathematically defined very precisely:

Var = sum (Xi - X<bar>)^2 / (n-1)

This definition not only allows you to compute the variance itself, but you can use that formula in mathematical derivations and proofs to determine other useful quantities.

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

Bob Jordan (jordan)
New member
Username: jordan

Post Number: 2
Registered: 3-2003
Posted on Sunday, December 19, 2010 - 3:01 pm:   

I looked up Wikipedia and they said:

The term variability, "the state or characteristic of being variable", describes how spread out or closely clustered a set of data is.

So to me that means an SD and not a variance

When you talk about the scatter of a sample you describe its mean and SD as a first cut. You do not describe its variance - Almost EVER.

So to say that R^2 describes the variability explained is ... well Hogwash!
Top of pagePrevious messageNext messageBottom of page Link to this message

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 379
Registered: 9-2001
Posted on Sunday, December 19, 2010 - 2:45 pm:   

Bob - I could almost buy your arguments, but how would you define "variability"?

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

Bob Jordan (jordan)
New member
Username: jordan

Post Number: 1
Registered: 3-2003
Posted on Sunday, December 19, 2010 - 1:36 pm:   

One needs to think about what RPD is. It is in fact a very simple transform of the adjusted R^2 value � nothing more.

In fact

RPD=1/sqrt(1-Radj^2)

As long as you get the degrees of freedom correct.

But the relationship is defined as just a simple ratio of SE of fit to SD of data but in an inverse sense.

From this one can see that it is hyperbolic in form, and is extremely non-linear.

I prefer here to use the inverse of the RPD which seems to make more sense and is a more linear relationship.

We call this variable Ya the reversed Russian R character (&#1071;). Unfortunately this character can be harder to type in some situations.

The use of this term makes sense when your see it defined as

Ya^2=1-Radj^2 = (SEfit/SD)^2

OR simply

&#1071;^2=1-R^2

In fact the R to use must be the adjusted R to get the degrees of freedom correct. Now:

Ya = SEfit/SD

It has a reverse quality direction to R^2 in that smaller is better.

One nice feature of the Ya term is that one can plot Ya against the number of PLS factors and use a pure percentage scale.
When there are no factors the SE becomes the data SD. As the factors increase the Ya will tend to reduce. This makes those RPD or similar plots much simpler to interpret and to compare.

Another feature of Ya or &#1071; is that it is a measure of the �variability� (not variance) that is �unexplained�.
Because it is effectively a standard deviation one cannot add it algebraicly and one must use vector addition (Pythagorus) and take into account any correlation components... But I always hated that definition of R2 as the fraction of variability explained � who measures variability using variance?

One could go into a whole paper on this side of things.

Bob J.
Top of pagePrevious messageNext messageBottom of page Link to this message

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 378
Registered: 9-2001
Posted on Sunday, December 19, 2010 - 12:31 pm:   

Jeff - first, some truisms:

1) No single statistic is going to tell you everything you want to know about a calibration.

2) Just about any statistic you might calculate is going to depend on the distribution of the data.

3) Before the advent of Chemometrics, with a related tendency for everybody and his brother to create new statistics (and acronyms for them) at the drop of a hat, statistics were created by statisticians, who were very circumspect about any new statistic they might create. For starters, they would ordinarily specify any conditions the data had to meet, in order for their new statistic to be valid, which might very well include the distribution of the data.

"Validity" (my term, BTW, not theirs) meant that the sampling distribution of the statistic had to be known, so that confidence limits for a given computed value of a statistic could be used to determine whether that statistic was "significant" (this is the proper statisticians' term) or not. "Statistically significant" is the formal term that means that the difference is larger than the random variations superimposed on the data can account for, and therefore allows you to tell objectively whether there is indeed a real difference between two values of that statistic, and by extension of the underlying data sets that gave rise to them.

In the absence of being able to show statistical significance, you cannot say anything meaningful about differences between statistics computed for data under different circumstances. But in order to determine statistical significance you need to determine the corresponding confidence intervals, which in turns means you need to know the sampling distributions of those statistics.

For most of the well-known statistics (Normal, t, chi-square, F, R, etc.) the mathematics underlying them are well-known and the sampling distribution and confidence intervals can be computed from analytic mathematical expressions.

For others, which could not be reduced to analytic mathematics, sampling distributions were determined empirically, sometimes using computer simulations and, especially in the early days, by doing actual physical experiments, repeated many (hundreds and thousands, even) times to ensure that the distributional properties of the statistic were well-characterized.

But no statistician worth the name would use a statistic that did not have a known sampling distribution, or the conditions (e.g., how was the data distributed, were the errors random and independent, how were the errors distributed, etc.?) for which that sampling distribution was applicable to. If he could not satisfactorily answer those questions, and also show that the data at hand in fact met all the conditions specified, he would not use that statsitic. He would use a different statistic that was sensitive to the same properties of the data he was trying to investigate. There is a large class of statistics called "non-parametric" statistics, that are relatively insensitive to the conditions of the data. Their corresponding downside is that they are generally less sensitive to detecting real differences between different data sets.

The bottom line is that it's easy to compute a statistic, but it's not so easy to determine the meaning of it. Statisticians have spent thousands of man-years in coming up with ways to ensure that a given statistic you calculate is meaninful.

In creating the RPD statistic Phil meant well, but by failing to follow up and determine its statistical properties, I'm afraid he left questions like yours be unanswerable, at least in an objective and rigorous manner.

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

Jeff Kallestad (jkallestad)
New member
Username: jkallestad

Post Number: 1
Registered: 12-2010
Posted on Saturday, December 18, 2010 - 3:42 pm:   

I am trying to evaluate the usefulness of several PLS models I have generated. My question is about the assumptions made in the RPD value proposed by P.C. Williams. To illustrate, assume (hypothetical ) 3 sample sets for PLS analysis: The first set generated by selecting a smaller population of samples from a much larger data set so that the model development set will have Y values evenly distributed across the range of analyte variation; a second set collected randomly from a wild population where the Y values are distributed normally (extreme values underrepresented); and a third set collected randomly from a controlled breeding population having opposite extremes of the analyte, and where the distribution of Y values is bimodal, i.e. extreme values over-represented and intermediate values are scarce. Also assume the sample number, mean, and range of Y variation is the same for each set, and the RMSEP in the PLS model generated with each set is equivalent. Using the RPD computed for each of these sets, and using the Williams ranking paradigm, the outcome could vary by 2 to 3 ranking levels simply because different distributions of Y variables produce a different standard deviation of Y, and not because of the estimated error. Should we assume that RPD is most applicable to models developed from normally distributed Y values? Are there alternative ways to compare models and assess their predictive capacity for screening�generation of reference data for genetic analysis�quality control�etc.

Add Your Message Here
Posting is currently disabled in this topic. Contact your discussion moderator for more information.