NIR Discussion Forum: Determination coefficient and non-centered data

Determination coefficient and non-cen... Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » Chemometrics » Determination coefficient and non-centered data

« Previous Next »

Author

Message

Ciaccheri Leonardo (leonardo)
Junior Member
Username: leonardo

Post Number: 6
Registered: 5-2010

Posted on Monday, June 07, 2010 - 3:35 am:

Jerry,

do you mean that the choice between centering or non-centering have to be made only when applying regression or classification tools; while in explorative analysis, like PCA, centering is always the best option?

Best Regards.

Leonardo

Jerry Jin (jcg2000)
Senior Member
Username: jcg2000

Post Number: 29
Registered: 1-2009

Posted on Friday, June 04, 2010 - 7:38 pm:

Leonardo,

The three ways you listed for evaluation of R^2 are equivalent. They all work as the same metric : the proportion of variation in y attributable to the prediction model.

R^2 is fixed for a model whether or not you mean-centered your data. If you found these three equations gives your different R^2, I suspect you didn't calculate it properly.

Data-centering prior to PCA analysis is a default because PCA steps from eigenvalue decomposition of variance-covariance matrix where the variance is defined by differences from the mean value. Data-centering serves two roles here: it makes the math manipulation easier; it makes the geometric illustration of PC more understandable. In a word, data-centering is simply for the convenience of computation.

Best,

Jerry Jin

Ciaccheri Leonardo (leonardo)
New member
Username: leonardo

Post Number: 5
Registered: 5-2010

Posted on Friday, June 04, 2010 - 2:59 am:

I have purchased the Tom's article and it was very interesting. My doubts about R-squared, however, were simply an example of the doubts that raised in me when working with non-centered data.

A friend, that works in chemometrics too, said me that in spectroscopy is often better do not center the data because it gives you more easily interpetable loadings.

My experiments, however, highlighetd a series of problem in doing it. Another example is that PCA on non-centered data do no more maximizes data variance but data sum of squares. In other words, if your spectra have a high peak with low variance and a low peak with high variance, is the high peak that goes on the PC1 (I have tried it). This means that PCs are not, necessarily, ordered by decreasing information content.

I have tried to find in literature something about when is better to work with non-centered data and how to interpretate the results in that case, but with little success. Most of the didactic books and articles I found assumed the data were centered.

Do someone have some advice to give about this question?

Best Regards

Leonardo

Ian Michael (admin)
Board Administrator
Username: admin

Post Number: 27
Registered: 1-2006

Posted on Friday, May 28, 2010 - 2:53 am:

I sincerely hope not!! It would be illegal.

All articles can be bought online with a credit card for immediate access. It costs only �12. Just click on the "Buy article on-line" link.

Ciaccheri Leonardo (leonardo)
New member
Username: leonardo

Post Number: 4
Registered: 5-2010

Posted on Friday, May 28, 2010 - 2:21 am:

Thank you very much Tony.

Unfortunately my institute is not a subscriber of NIR news. Do you know if this article can be found somewhere on the web?

Best Regards.

Leonardo

Tony Davies (td)
Moderator
Username: td

Post Number: 231
Registered: 1-2001

Posted on Thursday, May 27, 2010 - 4:56 am:

Hello Leonardo,

I would go fot your first method.

You might find it useful to read an NIR news article by Tom Fearn; the reference is: Fearn, T., NIR news 11/1, 14 (2000).

The most important message about R^2 is "Do not over-interpret" RMSEP is much more important!

Best wishes,

Tony

Ciaccheri Leonardo (leonardo)
New member
Username: leonardo

Post Number: 3
Registered: 5-2010

Posted on Wednesday, May 26, 2010 - 3:24 am:

One of the parameter used to asses the goodness of fit is the so-called determination coefficient, R^2. I have found, in literature, three way, to evaluate it:

1) The squared correlation coefficient between predicted and reference y-values.
2) The ratio of predicted-y variance over reference-y variance.
3) 1 - (RMSEC^2 / reference-y variance); this last is exactly true only if you calculate RMSEC and variances simply dividing by the number of samples.

Until you work on mean-centerd data all three definition re equivalent and bring to the same value. This is not true when you work on non-centerd data, where you got three differnt values (I have tried).

My questions are. What is the true definition for R^2? What is best suited when working on non-centered data?

Thank you for kind assistance.

Leonardo Ciaccheri