Mahalanobis distance vs Leverage Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » Chemometrics » Mahalanobis distance vs Leverage « Previous Next »

Author Message
Top of pagePrevious messageNext messageBottom of page Link to this message

Ciaccheri Leonardo (leonardo)
Junior Member
Username: leonardo

Post Number: 8
Registered: 5-2010
Posted on Wednesday, February 09, 2011 - 3:15 am:   

Dar All,

I have some difficulties in understanding why to introduce the 1/n offset in the Leverage formula.

This means that, in a PCA, a sample having null scores on every PC (a perfectly mean sample) has a nonzero Leverage equal to 1/n. What's its physical meaning?

Thank you for help.

Leonardo
Top of pagePrevious messageNext messageBottom of page Link to this message

Dongsheng Bu (dbu)
New member
Username: dbu

Post Number: 2
Registered: 6-2006
Posted on Friday, June 30, 2006 - 12:47 pm:   

Dear All,

Mathematically, Unscrambler leverage (L) is a un-centering version of (PCA, PLS) scores' Mahalanobis Distance.
L = 1/n + S*(S�S)^-1*S�
where S are the PC scores, n is the number of samples.

Mahalanobis Distance:
D2 = (x - y)* M^-1*(x - y)�
where x are the individual�s values; y are the corresponding means from model values. Those values could be PC scores.
M^-1 is the inverse covariance matrix of the model values.

Camo (Unscrambler) has Method References document available.

Regards,
Dongsheng
Top of pagePrevious messageNext messageBottom of page Link to this message

Howard Mark (hlmark)
New member
Username: hlmark

Post Number: 16
Registered: 9-2001
Posted on Wednesday, May 17, 2006 - 4:21 am:   

Nieves - it's from the basic definition of the Multivariate Normal Distribution. I don't know that there's any special name for it, but the univariate equivalent would be the Z-test.

Of course, there's also the implicit underlying ASSUMPTION that your data in fact follows that distribution.

Howard

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

Nieves N��ez Romero (edurne)
New member
Username: edurne

Post Number: 2
Registered: 1-2006
Posted on Wednesday, May 17, 2006 - 2:25 am:   

I'm trying to understand your discussion. Normally, when you make PCA, samples with a H bigger than 3 are classified as outliers. I read in a paper that those samples have a probability of 0.01 or less to be non-member of the group. �Anybody knows which is the statistical-test that is used?
Top of pagePrevious messageNext messageBottom of page Link to this message

hlmark (Unregistered Guest)
Unregistered guest
Posted on Friday, May 12, 2006 - 10:29 am:   

Tony - you're probably right about the software writers, but we should be trying to promote standardization. We've got enough trouble with non-standard nomenclature already - let's "fight the good fight" to try to keep it from spreading.

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

Tony Davies (Td)
Moderator
Username: Td

Post Number: 128
Registered: 1-2001
Posted on Friday, May 12, 2006 - 10:17 am:   

Howard,

My comments were from Tom Fearn. The point he was trying to make is that some software writers just like to be different!

Tony
Top of pagePrevious messageNext messageBottom of page Link to this message

hlmark (Unregistered Guest)
Unregistered guest
Posted on Friday, May 12, 2006 - 8:38 am:   

Tony - I have to disagree with your comment that Mahalanobis Distance can be squared or not "according to taste". Mahalanobis' orginal paper defines it as the unsquared form.

More importantly, the difference between the squared and unsquared forms is exactly analogous to the difference between variance and standard deviation, and while variances are very important qantities in the proper circumstances, nobody says that you can quote standard deviation one way or the other "according to taste"! Mahalanobis Distance is the generalization of standard deviation to multidimensional space.

Bo - Rich Whitfield has published tables of critical values for Mahalanobis Distance: Appl. Spect.; 41(7), p.1204-1213 (1987). As the extension of standard deviation, however, you can usually use the "rule of thumb" value of 3 M.D. as a decision threshold for when a sample is not part of the population of comparison - just as you can for standard deviation in the univariate case. That's one of the advantages of sticking to the defined form. It's true that you could also do the comparison using the squared form by squaring all the values in Whitfield's tables, but you shouldn't, just as nobody does univariate hypothesis testing using variances.

Using the value of 3 Mahalanobis Distances as the threshold value for the decision on single samples means that two groups should be separated by a minimum of six Mahalanobis Distances, in order to avoid overlap and potential misclassification. I don't know that the other formulations allow you to make these types of extensions to the classification problem.

Howard

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

(Unregistered Guest)
Unregistered guest
Posted on Friday, May 12, 2006 - 7:33 am:   

Great! The upcoming chemometric space on this subject was indeed a coincidence. Pls also explicitly mention Mahalanobis distance as this may be more known to some than variations of the hat matrix. Another question is how to judge the value reported from any of these tests - what value is large enough to merit a warning. This is somewhat described in the litterature but may need repeating and comments.
I am much grateful for your response
Top of pagePrevious messageNext messageBottom of page Link to this message

Tony Davies (Td)
Moderator
Username: Td

Post Number: 127
Registered: 1-2001
Posted on Friday, May 12, 2006 - 6:33 am:   

A comment from Tom Fearn:

As usual there is lots of scope for different software using different definitions, but essentially leverage and Mahalanobis Distance are indeed measuring the same thing - distance of a particular sample (spectrum) from the centre of the training data, using a metric that is based on the covariance matrix of the training data.

There will usually be differences in scaling, the MD's being bigger by a factor of n or n-1 for example, there may be a difference in whether the intercept is included in the model for the purposes of calculating leverage (including it gives an offset of 1/n compared with what you would get from centering the data and not including an intercept). Another possible difference is that leverage is a squared distance, whereas MD can be quoted squared or not, according to taste.

However all these differences are essentially cosmetic, fundamentally the two statistics are the same.

Tom

As it happens Tom's next chemometric Space in NIR news 17.4 is titled "Diagnostics 2: leverage and the hat matrix". Tony
Top of pagePrevious messageNext messageBottom of page Link to this message

Bo Allvin (Unregistered Guest)
Unregistered guest
Posted on Thursday, May 11, 2006 - 3:11 pm:   

Dear All,
The GRAMS software package provides Mahalanobis distance data for say a PLS calibration. The Unscrambler package provides Leverage data but not Mah distances. We are having an argument if these 2 actually express the same information or not. I believe they are mathematically just slightly different e.g. Leverages by definiton ranging between 0 and 1 in the calibration but they can actually both be used to scrutinize your X-data and provide the same information regarding e.g an NIR spectra being different from the population or not. Any comments on this?

Add Your Message Here
Posting is currently disabled in this topic. Contact your discussion moderator for more information.