NIR Discussion Forum: Sample selection for calibration

Sample selection for calibration Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » Chemometrics » Sample selection for calibration

« Previous Next »

Author

Message

Tony Davies (td)
Moderator
Username: td

Post Number: 141
Registered: 1-2001

Posted on Thursday, February 08, 2007 - 4:19 am:

Dear Christian,

The potential problem that I wanted to warn you about is how do you answer the question "Where does your validation set come from?" It is quite a common error for people to use the samples left over from the selection program. The problem is that these are no longer independent samples. All the samples left over will (if the selection program is any good!) be represented in the selected training samples. The suggested way of using selection programs is to select sufficient samples to make both the training set and the validation set and then randomly assign the selected samples. This may not make "perfect" calibration sets but it should avoid over optimistic SEPs.
Best wishes,
Tony

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 71
Registered: 9-2001

Posted on Wednesday, February 07, 2007 - 5:37 pm:

Christian - there's a fairly simple sample-selection algorithm described in "Unique-Sample Selection via Near-Infrared Spectral Subtraction"; Anal. Chem.; 57; p.2299-2303 (1985). It's based on the use of Gauss-Jordan curve fitting to remove common spectral features, and at every step the largest residual feature identifies the sample with the "most different" spectrum. Strictly speaking it's not really a multivariate algorithm, but is still fairly effective. It's easy to code in almost any language and simple enough that even slow langauges like BASIC will run fast on large data sets.

Howard

\o/
/_\

Dongsheng Bu (dbu)
Junior Member
Username: dbu

Post Number: 7
Registered: 6-2006

Posted on Wednesday, February 07, 2007 - 4:32 pm:

Dear Christian,

If you use matlab and Unscrambler, I would be very happy to discuss with you.

I am not sure if there is commercial package for auto-selection of calibration samples.
If just talking about Mahalanobis distances, matlab STAT Toolbox has mahal.m function. The website http://en.wikipedia.org/wiki/Hotelling's_T-square_distribution is also good for coding.
I made simple Mahalanobis script in matlab to calculate from PCA or PLS scores. (x is scores from validation, X is score from calibration. For sample selection, you can assign x=X)

[m,n] = size(x);
meanX = mean(X);
covX = cov(X);
y = repmat(meanX,m,1);
D = diag((x-y)*pinv(covX)*(x-y)');

Computing Limit at a confidence level could be tough to coding. I remember PLS_Toolbox has tsqlim.m.
Unscrambler has Hotelling T2 and Hotelling T2Lim matrices available after save a PCA or PLS model. Hotelling T2 is closely related to the squared Mahalanobis distance. You can turn on Hotelling T2 Ellipse plotting and do re-calculation with selected samples.

We know that Sample selection is dependent on other factors as well, such as sampling information, variable selection, X-Y relation outliers, y representative, sample residual in the model. Even stdsslct in calibration transfer (PDS) (available in PLS-Toolbox and Unscrambler) would be good source.

Please share your opinion and results.

Best regards,
Dongsheng

Christian Mora (cmora)
New member
Username: cmora

Post Number: 3
Registered: 2-2007

Posted on Monday, February 05, 2007 - 11:52 am:

Hello Tony;
The number of scanned samples can vary from 500 up to 2,500 (depending on the study). From those I have to select a subset for calibration modeling.
Christian

Tony Davies (td)
Moderator
Username: td

Post Number: 140
Registered: 1-2001

Posted on Monday, February 05, 2007 - 9:15 am:

Hello Christian,

I pleased to know that you have the book (4Ts)! You didn't answer my question about the number of samples. There are some dangers that I wanted to warn you (and other readers) about.
Best wishes,
Tony

Christian Mora (cmora)
New member
Username: cmora

Post Number: 2
Registered: 2-2007

Posted on Saturday, February 03, 2007 - 7:43 pm:

Hi Tony;
Thanks for the reference (I have that book). What I'm doing is using different algorithms to get a final idea on the number of samples required for calibration models (for several datasets). So, basically I'm exploring the algorithm, piece by piece, to get a better understanding of how the samples are selected.
Thanks
Christian

Tony Davies (td)
Moderator
Username: td

Post Number: 139
Registered: 1-2001

Posted on Saturday, February 03, 2007 - 2:43 pm:

Hello Christian,

This is advertising (but it is an NIRP book) you should read chapter 15 of "A User Friendly Guide to Multivariate Calibration and Classification" (Naes et al). It describes a method based on PCA.

I would be interested to know why you want to do this. Do you have a large number of un-analysed samples?

Best wishes,

Tony

Christian Mora (cmora)
New member
Username: cmora

Post Number: 1
Registered: 2-2007

Posted on Friday, February 02, 2007 - 9:58 pm:

Dear list members;

Dealing with the issue of sample selection for NIR calibration models, I'm wondering if somebody can provide a reference where to get a code (doesn't matter the language) for sample selection based on the Mahalanobis distances (to not re-invent the wheel; I'm thinking on the algorithm proposed by Gerd Puchwein in 1988 or maybe the one published by Shenk and Westerhaus in 1991 but not using WinISI or some other). I would like to test this selection algorithm against others like Kennard-Stone or Duplex (for which I was able to figure it out how to write a small program to do it) with my own samples.

Thanks in advance

Christian Mora