NIR Discussion Forum: Global H value

Global H value Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » I need help » Global H value

« Previous Next »

Author

Message

David Donald (ddonald)
New member
Username: ddonald

Post Number: 1
Registered: 11-2006

Posted on Tuesday, October 31, 2006 - 7:52 pm:

Here is a footnote in history regarding the origin and use of GH, the �standardized Mahalanobis� value used by WinISI (and no other).

The GH value is a modification of the Mahalanobis distance, H, in which H*H (H squared) is divided by the number of dimensions, p, used to derive H.
Obtaining a reference for GH is nigh on impossible since it has (amazingly) never been published and only been referred to as GH.

The publication trail for the use of GH originates from the article: J.S. Shenk and M.O. WesterHaus "Population Definition, Sample Selection and Calibration Procedures for Near Infrared Refelectance Spectroscopy," Crop Science, 1991, 31:469-474. However, in this article the formulation of GH is absent and refers to another article: Marten et.al. "Forage analysis with near infrared reflectance spectroscopy (NIRS): Analysis of forage quality." A handbook published in 1989 by the United States Department of Agriculture. Strangely enough, there is absolutely no mention of the GH value which is so heavily referenced to in the former article (the primary source of most references to GH).

This is not to say that GH is not unfounded. Shenk and Westerhaus (founders of the WinISI software) are pretty smart guys in using GH since in the late 80's and early 90's, computation power was still fairly limited. Why is that important? Well, H*H (i.e. the original Mahalanobis distance squared) has a chi-squared distribution (asymptotically for large sample sizes) with p degrees of freedom (where p is the dimensionality used to calculate H, e.g. if twelve principal components were used, then p = 12). Now, the shape (distribution) of a chi-squared distribution (CSD) changes with p and one of those changes is the mean value. However, the mean of a CSD is simple to calculate since it is equal to the degrees of freedom, p. Hence, if you divide H*H by p, you will get a CSD that has been modified so that the mean value is equal to one.

So how does this help identify outliers and the like, well, it doesn't really but for p > 1, a GH value greater than 3 will always guarantee that there is a fleeting small chance that the observed value (i.e. spectra) is a result of random chance alone and hence is most likely to be an "outlier." Is the magical value of 3 always indicative of an "outlier" you might ask yourself? No.

The magical number of 3 (which I will refer to as M later) actually decreases with p. For p = 5, M = 2.214; p = 10, M = 1.83; p = 20, M = 1.57; p = 40, M = 1.39; p = 80, M = 1.27. Now, as p gets really, really infinitely large, M approaches one from the right. There have been a few published tables with this information at the time {Whitfield, R. G., Gerger, M. E., Sharp, R. L.; Applied Spectroscopy; 41 (7), p.1204-1213 (1987)}, but one number is easier to remember than two and whole lot easier to remember than lots of numbers.

The use of GH is intended to bring in a common scale to asses H values, but the shape of H*H (and GH*GH) changes with p, but at the time, do to limited computation, GH > 3 was a safe bet to identify outliers. Now that we live in a time where computation time is no longer a limiting factor (at least for the GH and H question), we should be looking at probability values instead of GH values. As a case in point, the Unscrambler (registrated trade mark and all that stuff) uses a Hotellings T^2 test for outlier detection (note the author has no affiliation with CAMO to the point he rather program in Matlab and C#).

So, the GH has been revealed. It has been another case of a useful and technically important innovation gently slipping in the publication stream without adequate reference and description. However the time of GH has passed us by thus ends this footnote in history.

Ainhoa

Posted on Friday, July 23, 2004 - 2:15 am:

Dear all,
This is Ainhoa, a MS student in her final thesis. I am using NIRS to evaluate feedstuff. My software is using the Global H to detect outliers. I have looking arround and I could not find how the program is obtaining the GH value. Could some one tell me the formula used to get this value?
Many thanks for your help,
>Ainhoa<

hlmark

Posted on Friday, July 23, 2004 - 5:50 am:

Ainhoa - basically, h is the same as Mahalanobis Distance (it's a little hard to tell because it's defined in matrix terms instead of individual sample terms but if you work out the calculations done they're the same). I know I used to have a reference for it, I'll see if I can still find that.

Howard

\o/
/_\