Division of sample set into calibration and validation sets

ianm's picture

12. Division of sample set into calibration and validation sets.

The following question has generated four responses so far, so I thought I would send them out. If more responses come in, I will paste them on the end of these.

Original Question

When doing a calibration, the usual division of samples is two thirds in the calibration set and one third in the validation set. But why this division? Is there some statistical basis for doing this? And if one is using the spectra for more than one constituent, does this mean twice as many spectra should be in the validation and calibration set. Or is there some cross-over for the two, such that the number of spectra should be about 1.5 times the number for a single constituent?

Responses in order received

From: Howard Mark [email protected]

Bruce - It's mainly a matter of habit and convenience: you want as many samples as possible in the calibration set, and simultaneously, as many samples as possible in the validation set. Since the total number of samples is fixed, both desires cannot be satisfied at once, so the splitup is heuristic. As far as I konw, nobody considers the 2:1 ratio to be written in stone; other divisions, such as 50-50 are also used. One solution, which I've seen others recommend as well as myself, is that after you've validated the model to whatever degree satisfies you, combine the two sets and use all samples for calibration, using the same calibration conditions (factors, wavelengths, data transformations, etc.) as the model you arrived at.

A related question, which we might want to throw out to the group, and which arose out of my work with Gary Ritchie, is: how can we verify independently that a given validation set of samples is a legitimate (I hesitate to use the redundancy "valid") validation set for the calibration set at hand? In theory, random selection tells us that this should be so, but first of all, how many people choose their validation samples in a verifiably random manner, and secondly, when dealing with regulatory bodies (such as FDA), an INDEPENDENT test method boosts your case.



An addition to Howard's last paragraph, to be considered at the same time: How many researchers take spectra of the calibration and validation set over the space of several days? Or do replications over days or weeks? Doing this gives a better statistical basis, that is, including instrument and condition variability. Also, it is easier to take validation and calibration spectra at the same time, but statistically, it may be better to separate the two into separate sessions.



From: "Jim Reeves, NCML, B-200" [email protected]

I'm probably wrong, but I'd say it's based on carry out from old software, number of samples available, statistical considerations, etc. In the original software I used, Westerhaus and Shenk, for the PDP-11 computers, you had to have something like 9 or 10 samples for each wavelength to be used, so you needed somewhere around 100 samples to use 9 wavelengths, the max allowed. Second you choose the validation set as every nth sample, 2nd, 3rd, 4th etc. Combining the two and wanting as large a validation set as possible, one often used 1/3. Hope this makes some sense. Using 1/2 as the validation required more samples than people often had and 1/4 gave too small a validation set. Finally, the number had nothing to do with the number of constituents, since each calibration was independent of the next. I also know statistians who would arque that one should have a lot more than 10 samples for each wavelength or factors used, and more samples than wavelengths.

Jim Reeves

From: JIM HERMILLER [email protected]

Bruce, here's my quick Monday morning pre-coffee response to your sample set size questions. I'm a strong proponent of cross-validation, as this procedure most efficiently uses the information in the samples. Cross-validation is like having all of your samples in the calibration set and all of your samples in the validation set, which is better than having two-third of your samples in the calibration set and one-third of your samples in the validation set. After cross-validation, the standard error of prediction (or predicted residual error sum of squares or whatever) is an excellent measure of the quality of the calibration, because all of the samples are included. Two cautions come to mind, though. If replicate spectra or analyses are present, then all replicates must be treated as a single sample during cross-validation. Before putting a calibration into the field, a second independent sample set should still be used to check the calibration.

Cross-validation is often associated with factor-based methods like partial least squares and principal components regression, though I don't know why. Cross-validation is applicable to any calibration technique, including the traditional simple and stepwise regressions.

Jim Hermiller

From: Karl Norris [email protected]


I am one who often uses two thirds for calibration, but I don't know of any statistical justification for this choice. I choose two thirds because I wish to get enough samples into the calibration, so that I will limit the problem of overfitting. If I have several hundred samples then I can use a 50/50 split. I use the same split for each constituent, and I don't see that this requires any additional samples.


From: "Emil W. Ciurczak" [email protected]

Subject: Re: Sample set size

In the case of pharmaceutical samples, we have a slightly different problem: too many samples, each nearly identical to it's neighbor To get a reasonable set of calibration samples, we have to scan hundreds (dare I say thousands?) of tablets, then use some scatter-plot software to give us an indication of what might be a broad enough set for equation generation. Seldom does one production lot produce enough variation to build a robust model, but one we do build. The interim model is then used to sort a second, then third, to Nth lot for a spread of values. These different tablets are then assayed by the official method and a new equation generated.

Over time, redundant samples are weeded out to give the desired boxcar distribution of values. Outliers are treasured in our industry: they are immediately added to the calibration set and a new equation generated. Bottom line: we use 100 for a calibration, then 100 of another set for cross-validation.