Author |
Message |
Ciaccheri Leonardo (leonardo)
Junior Member Username: leonardo
Post Number: 6 Registered: 5-2010
| Posted on Monday, June 07, 2010 - 3:35 am: | |
Jerry, do you mean that the choice between centering or non-centering have to be made only when applying regression or classification tools; while in explorative analysis, like PCA, centering is always the best option? Best Regards. Leonardo |
Jerry Jin (jcg2000)
Senior Member Username: jcg2000
Post Number: 29 Registered: 1-2009
| Posted on Friday, June 04, 2010 - 7:38 pm: | |
Leonardo, The three ways you listed for evaluation of R^2 are equivalent. They all work as the same metric : the proportion of variation in y attributable to the prediction model. R^2 is fixed for a model whether or not you mean-centered your data. If you found these three equations gives your different R^2, I suspect you didn't calculate it properly. Data-centering prior to PCA analysis is a default because PCA steps from eigenvalue decomposition of variance-covariance matrix where the variance is defined by differences from the mean value. Data-centering serves two roles here: it makes the math manipulation easier; it makes the geometric illustration of PC more understandable. In a word, data-centering is simply for the convenience of computation. Best, Jerry Jin |
Ciaccheri Leonardo (leonardo)
New member Username: leonardo
Post Number: 5 Registered: 5-2010
| Posted on Friday, June 04, 2010 - 2:59 am: | |
I have purchased the Tom's article and it was very interesting. My doubts about R-squared, however, were simply an example of the doubts that raised in me when working with non-centered data. A friend, that works in chemometrics too, said me that in spectroscopy is often better do not center the data because it gives you more easily interpetable loadings. My experiments, however, highlighetd a series of problem in doing it. Another example is that PCA on non-centered data do no more maximizes data variance but data sum of squares. In other words, if your spectra have a high peak with low variance and a low peak with high variance, is the high peak that goes on the PC1 (I have tried it). This means that PCs are not, necessarily, ordered by decreasing information content. I have tried to find in literature something about when is better to work with non-centered data and how to interpretate the results in that case, but with little success. Most of the didactic books and articles I found assumed the data were centered. Do someone have some advice to give about this question? Best Regards Leonardo |
Ian Michael (admin)
Board Administrator Username: admin
Post Number: 27 Registered: 1-2006
| Posted on Friday, May 28, 2010 - 2:53 am: | |
I sincerely hope not!! It would be illegal. All articles can be bought online with a credit card for immediate access. It costs only �12. Just click on the "Buy article on-line" link. |
Ciaccheri Leonardo (leonardo)
New member Username: leonardo
Post Number: 4 Registered: 5-2010
| Posted on Friday, May 28, 2010 - 2:21 am: | |
Thank you very much Tony. Unfortunately my institute is not a subscriber of NIR news. Do you know if this article can be found somewhere on the web? Best Regards. Leonardo |
Tony Davies (td)
Moderator Username: td
Post Number: 231 Registered: 1-2001
| Posted on Thursday, May 27, 2010 - 4:56 am: | |
Hello Leonardo, I would go fot your first method. You might find it useful to read an NIR news article by Tom Fearn; the reference is: Fearn, T., NIR news 11/1, 14 (2000). The most important message about R^2 is "Do not over-interpret" RMSEP is much more important! Best wishes, Tony |
Ciaccheri Leonardo (leonardo)
New member Username: leonardo
Post Number: 3 Registered: 5-2010
| Posted on Wednesday, May 26, 2010 - 3:24 am: | |
One of the parameter used to asses the goodness of fit is the so-called determination coefficient, R^2. I have found, in literature, three way, to evaluate it: 1) The squared correlation coefficient between predicted and reference y-values. 2) The ratio of predicted-y variance over reference-y variance. 3) 1 - (RMSEC^2 / reference-y variance); this last is exactly true only if you calculate RMSEC and variances simply dividing by the number of samples. Until you work on mean-centerd data all three definition re equivalent and bring to the same value. This is not true when you work on non-centerd data, where you got three differnt values (I have tried). My questions are. What is the true definition for R^2? What is best suited when working on non-centered data? Thank you for kind assistance. Leonardo Ciaccheri |