NIR Discussion Forum: Model selection without direct error prediction calculations?

Model selection without direct error ... Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » Chemometrics » Model selection without direct error prediction calculations?

« Previous Next »

Author

Message

Forrest Stout (forrest)
Junior Member
Username: forrest

Post Number: 8
Registered: 7-2006

Posted on Tuesday, July 04, 2006 - 3:01 pm:

Thanks for the feedback, Gabi. I look into some such investigations of sufficient calibration sample variability.

Gavriel Levin (levin)
Junior Member
Username: levin

Post Number: 7
Registered: 1-2006

Posted on Monday, July 03, 2006 - 11:30 pm:

Hi,

Forest, I read your response to my comments, and yes, I was suspecting that you don't have enough samples, but not being sure, didn't refer to this, but in my comments I stress out the importance of representing the variability by large enough base, and also for the LWR you need lots of "library" from which to choose the nearest neighbours.
I will give you an example - in some distillations - the number of small level contaminants can be as many as 20 - and their concentrations vary from batch to batch. This introduces "everlasting" changes in the spectrum that have no relation to the say two constituents you try to measure - I now have a calibration and validation set of over 150 data points each - and periodically we "update" the model, test it on the original validation set to verify that it still predicts old compositions correctly should one appears again.

In this regard I don't know what is the optimal number of PC's for any given sample.

I can say the following - and this is an observation I have seen many times in feasibility studies:

You have a set of say 50 samples, and you run PLS1, and get a reasonable or even nice fit. The number of PC's (if using 1st derivative) say is 4. Then in order to better evaluate the prediction capability you remove 5, build a calibration and predict them. You can repeat this procedure 10 times until you predicted all 50.

It does happen that the number of PC's for different grups of 45 samples will differ. This is only because the set is too small, and some group of 45 samples can contain less variability and will require smaller number of PC's. The trouble is that this model, when predicts the 5 "unknowns" usually predicts worse then other models for other groups.
But as soon as we add samples to the base, this phenomenon disappaers, and once you have over a hundered, you can remove groups of 8 samples, perform the PLS1 on any remaining 92 sample sets, and the number of PC's will be the same for all groups.
When I do feasibility and the above variability on PC's occur the first thing I check is the loading weights for the first important PC's 3 to 4 (unless the model uses less of course) - if the loadings remain the same - i.e., the peaks still relate to the important information - then I am not worried, but if the loadings for the different groups of 45 samples change then I am really worried, and I usually ask for more samples.

One important piece of information regarding the MFI I mentioned - MFI being melt flow index for polypropylene - the values of MFI depends on more than just the length of the polymer chains. Thus, when you collect spectra from say 200 different samples of PP you find that you can have very nice .98 or .99 R value and when you check the loadings you do see very strong contribution from the C-H stretches, and this is nice. However, because there can be a large difference in MFI values due to other factors, that can have the same spectrum - there is no way to truly use NIR for MFI measurements. The only way to do that is to group the different grades (grade is a definition of sets of properties that are determined mostly by the processing parameters used in the polymerization) into groups of grades that have the other MFI influencing factors similar and build individual models for each group. However for people who produce hunderds of different grades this becomes a horrendous job and usually not done. It also introduces an operational hurdle - normally when we build on line predicction profiles we allow the software to choose the model for PLS1 by itself - based on the spectrum - but here we can not do it - therefore, the operators will have to manually select the correct model everytime they change to a different grade - and that can happen in a continous mode of operation - in short an operational nightmare.
This is also why the LWR doesn't work in such situations - because the nearest neighbours you choose can be associated with large variability in MFI because of the influence of the other factors.

Well, what can we do - life is a ...

Thanks,

Gabi Levin
Brimrose

Forrest Stout (forrest)
Junior Member
Username: forrest

Post Number: 7
Registered: 7-2006

Posted on Monday, July 03, 2006 - 2:25 pm:

Well, I looked at some old data sets that I've worked on in the past (corn moisture and gas octane), and they both show fairly similiar trends:
1) The optimal number of factors based on individual sample prediction error (vs. SEP over all prediction/cross val samples simultaneously) varies wildly.
2) SEPs are dramatically smaller (e.g. ~1/3 of original SEP) if the optimal model for each sample is employed (versus one model for all predictions).

This leads me to think that my current data is not abnormal in this regard. There may be some way of performing unknown spectrum dependent factor selection, but this may be more of a broad and complex chemometric problem than simply a quirk of my current data set. So, I doubt I will do any further active investigations but I do encourage curious parties to look at your validation/cross val results after such an individual val/cross val sample factor selection method, even if current methods make this impossible for true unknown predictions. This may be an important research area in chemometrics.

For now, I think I'll give that locally weighted regression a shot.

Dongsheng Bu (dbu)
New member
Username: dbu

Post Number: 3
Registered: 6-2006

Posted on Monday, July 03, 2006 - 11:47 am:

Hi Forrest,

When new spectra against to a model, Unscrambler provides results such as prediction uncertainty, spectral leverage (distances to model), spectral residuals, etc. References are not required, and those result matrices can be used to estimate prediction reliability. Of course, they can be used for data QC, model selection, or PLS factors selection as well. There are matlab scripts similar to this approach from people outside Camo. I also have tried by myself in matlab.

Regards,
Dongsheng

Forrest Stout (forrest)
Junior Member
Username: forrest

Post Number: 6
Registered: 7-2006

Posted on Monday, July 03, 2006 - 10:35 am:

I actually read Gabi's last post before Howard's last post but I didn't post a comment because of the late hour.

Howard, yep, I caught the misunderstanding.

Gabi, I'm familiar with examining SECV/SEP plots and such to find the point at which it is assumed that noise is being fitted into the modeling process. I'm also aware of the dangers of essentially overfitting a method/algorithm for a data set and then going out into the real world and getting terrible/unpredictable results due to modeling that is too specific to the exact calibration/cross-val samples used. At the same time, I am curious about a technique to pick factors based on unknown spectra, if this is even possible and even if it is not tried and true enough to use in the near future.

Gabi, in response to your statement:
"Build a quantitative model PLS1, use enough samples to represent the possible process varaiability so that the loadings for each PC of the optimal number of PC's will be adjusted for the process variability, then you will find that the number of PC's doesn't jump all ovger the place."

Have you ever looked at plots of prediction error versus number of factors used for a single validation sample? Then, have you ever checked to see which factor number model would give the best prediction for each individual validation sample? (I.e. rather than a sum error calc, like SECV or SEP.) The reason I ask is that I had not, until last week in an unrelated exercise, because I just assumed all individual sample optimal factor numbers would be in a tight range, except of course for outlier samples, which as you noted can be qualitatively filtered out. (I've done previous analysis on a variety of different data sets and I'm going to quickly revisit some of those to look at this problem.)

Just to double check your initial solution, would you guess that I have too few samples in my calibration set?

Howard and Gabi, thanks for your input!

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 32
Registered: 9-2001

Posted on Monday, July 03, 2006 - 9:10 am:

I guess I need to expand on what I said before, since Gabi, at least, seems to have at least partially misunderstood me.

First, regarding "routine" versus "research": if anything I agree with Gabi and was trying to reinforce the idea that for your routine analysis, you should use the tried-and-true methods, whether in a regulated environment or not.

What I was trying to suggest was that AT THE SAME TIME as you are using the safe methods for your routine analysis, you can, if you have the inclination, time and resources, try to develop new approaches to analysis. However, these new approaches are risky and should not be used for your routine analysis until they have been validated very well by yourself and preferably also by other scientists.

My second point that Gabi misinterpreted, is that the reconstruction I spoke about is the reconstruction of the data spectra from which the Principal Components are calculated; Gabi is 100% correct in stating that adding PCs does not endlessly improve the prediction of the constituent composition.

The reason for the difference is that the first few PCs that are calculated, model the systematic "real" spectral changes in the samples due to compositional (and physical) variations. At some point, usually fairly early in the process, all that systematic variation has been modeled, and all that is left in the spectra is the random variations, i.e., the noise. That is the point at which you will not get any further improvement in the model's ability to give accurate results, and indeed, the performance for calculating accurate constituent information will normally start to get worse beyond that. During calibration, this point can be detected by monitoring the usual calibration statistics: SEC, SECV, etc.

Continuing to calculate PC's after that point, however, will indeed continue to model the noise better and better, until, as I stated, when you've calculated as many PC's as you have spectra (or wavelengths in the spectra, whichever is less) you will get a perfect fit to the spectra, with zero error. That type of model, however, will almost certainly be useless for predicting composition.

Since your original question was predicated on the assumption of not using consituent information, however, fitting already-calculated PCs to the spectra of unknowns is almost the only route available to you. When doing this, you have a guideline: the number of PC's you used for the original calibration (for which you used concentration information) can serve as an upper limit to the number of PC's you use for the modelling of the spectra of the unknowns; by doing that you insure yourself against trying to model the noise.

Your problem then would be to find some measure, based on those PCs, that will be related to the composition. Of course we already know one: the dot-product between the PC scores and the calibration coefficients obtained from the original calibration calculation. Of course, this is what is ordinarily used when doing a prediction using PC calibration, although sometimes that fact is hidden by having some the calculations pre-computed.

Your task, Forrest, should you choose to accept it, is to find OTHER ways to utilize the PC loadings to estimate the sample composition from the results of the PC analysis. Just watch out for the caveat that both Gabi and I warned you about, using a new method for routine analysis before it has been sufficiently tested.

Gabi wasn't in the NIR community in the early days, when the only calibration method we had was MLR. If we were all afraid then of trying out new algorithms and calibration methods, we wouldn't enjoy the benefits of PLS and PCR now. But it took a long time, and a long learning curve to get from that point to where we are now. So I think we needn't be afraid to try new things, but we do need to be careful in learning about both the benefits and downsides of a new technique.

Howard

\o/
/_\

Gavriel Levin (levin)
Junior Member
Username: levin

Post Number: 6
Registered: 1-2006

Posted on Sunday, July 02, 2006 - 10:42 pm:

Hi Forest,

I am sorry to disagree with Howrad about sticking to the safe - it is not just that - it is the level of trust that you have that you will always have a correct prediction, and that you will always be able to cope with small changes that are occuring in low level contaminants that are always part of life.In a pure lab environment, when you are creating samples from perfect and always same compounds from same lots - it is all nice to play - but the proess out there does not play by the same rules.

If you have a model based on a large number of spectra, and there are differences that do not relate to the change in your analyte, the coefficients in the PLS1 "learn" how to "ignore" them and still predict well, within the established SEP.
In what you are trying to do, always have a different calibration curve, based on what you have now, you allow it to forget what it knows and create a calibration again. This is basically the LWR (locally weighted regression) -
You have a reservoir of spectra, you collect a new one, you first find the say 30 nearest ones, by say Mahala Nobis, or by squares, or by PCA, and you use those with their Y values to create a model, predict and throw away, and start all over again.
As I said, this did not work well, and I am not aware that it really works.

The question is always - does it really cut my SEP? Our experience was that it did not.

To the point made by Howard - the addition of PC's does not endlessley improve the ability to predict - in fact, the way we determine how many PC's to use is from the cross validation in the Unscrambler - you add them until the variance of the cross validation starts to get larger again and distant itself from the variance of the calibration variance. You can see when this happens that the SECV (or SEP) increases with addition of PC's the slope of the SEC and the SECV differ more and more, and the intercept difference grows. Usually this happens when you start modeling noise and you are attempting over fit.

Thanks,

Gabi

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 31
Registered: 9-2001

Posted on Sunday, July 02, 2006 - 7:17 pm:

Well, research is always risky, that way: you never know what you'll learn - or if you'll learn anything at all!

Howard

\o/
/_\

Forrest Stout (forrest)
New member
Username: forrest

Post Number: 4
Registered: 7-2006

Posted on Sunday, July 02, 2006 - 6:17 pm:

That is exactly my goal in attempting this unknown spectrum dependent model selection, to push the envelope on the tried and true methods because I see the potential for cutting my SEP down ~66%. At the same time, it might be a futile effort. As someone who is not familiar with the chemometric literature out there, I was curious if such methods have been succesfully explored.

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 29
Registered: 9-2001

Posted on Sunday, July 02, 2006 - 5:57 pm:

Forrest - re #1: since PCA (and PLS, also) is, by definition a least-squares curve-fitting method, then you will, of necessity, improve the fit with each succesive factor you add, up to the point where you have included as many factors as you have spectra, then the fit becomes exact. That's all known from the mathematics of the algorithm. The idea was to come up with a way of selecting a finite number of factors that could relate some aspect of that procedure to the sample composition. But mainly the point of my making this recommendation, as with all the recommendations I made, was to spark some "thinking outside the box (as they call it)", both on your part and the part of everyone monitoring the discussion; maybe someone could come up with a reasonable way to implement it.

Re #3: Marbach published his work in JNIRS; 13; p.241-254 (2005), with a correction published JNIRS; 13; p.377 (2005). I it he refers to an earlier paper he published with much more extensive development of the mathematical underpinnings.

I tend to agree with Gabi: for commercial application to a routine analysis, sticking with the "tried and true" is the safest way to proceed. On the other hand, if you have the inclination and resources to pursue alternate methods of analysis as well, then maybe we can all learn new methods. After all, this is what research is all about, and that's how Science proceeds.

Howard

\o/
/_\

Gavriel Levin (levin)
New member
Username: levin

Post Number: 5
Registered: 1-2006

Posted on Sunday, July 02, 2006 - 4:13 pm:

Dear All,

In many ways this sounds to me like shooting the dart, then drawing the circle around it.
I am sorry, but fooling around with the number of factors (Principle components from Unscarmbler?) is something I am not used to do.

Local Weighted regression has not carved great inroads, we tried it in the past for MFI in polypropylene - it is not something you put to qunatitative work in pharma environment. Maybe others.

My suggestion - get Unscrambler - build PCA model from real process samples (you can start off line for feasibility, but then you want real samples for the calibration and the validation.

Build a quantitative model PLS1, use enough samples to represent the possible process varaiability so that the loadings for each PC of the optimal number of PC's will be adjusted for the process variability, then you will find that the number of PC's doesn't jump all ovger the place. Then you need a predictor software that first of all looks at the new spectrum, calculates the distance to the PCA model from the prediction mode of the Unscrambler - based on the PCA model - and if it meets your criterion of match to the model spectrum - it is accepted, if not it is declared an outlier - then it calculates the value from the PLS1 model.

If you do that - you know that you always use the correct coefficients to the relevant wavelengths in the identification stage (the PCA distance calculation) and you know that you are using the correct coefficients for the quantitative model, the PLS1 model.

I am not truly aware of short cuts to the rigorous method.

I hope I helped a little bit.

Gabi levin
Brimrose
[email protected]
If you write to me I will be able to provide more information.

Forrest Stout (forrest)
New member
Username: forrest

Post Number: 2
Registered: 7-2006

Posted on Sunday, July 02, 2006 - 2:30 pm:

In reply to Michel:
I'm using Matlab.
My end desired result is a quantitative prediction.
I have been looking at this "disposable" calibration approach which lead me to this problem of attempting to pick the optimal number of PLS factors for each individual validation sample (but the calibration samples used are static). (I know this sounds crazy, but my goal in posting this was to see if maybe others have looked at this in the past.)

In reply to Howard:
End goal -> quantitative (analyte concentration from NIR spectra).

Well, I DO have the reference values. This is when I noticed that if I can pick the PLS model (already calculated from the calibration data) yielding the lowest prediction error for that validation sample I see two interesting observations:
First, my SEP (RMSEV) is cut in ~1/3 versus a single model for all validation samples. (Of course I suspected an SEP drop, but not to this degree.)
Second, the optimal PLS factor numer for each sample varies wildly with many excellent predictions at 1 factor and full factor models. I expected the validation sample dependent optimal factor number to be in a fairly tight range around the optimal factor number from RMSECV, etc., calculations. (E.g. if the 20 factor model gives the minimum RMSECV, then I'd expect the model factors giving the minimum prediction error for each sample to be in the 20 factor +/- 10 factors range. Instead I see a very wide distribution of optimal factor numbers.)

So, I'm seeing if I can preform this same sort of process without model selection based on a direct calculation of prediction error because this is impossible to do in true unknowns.

1) That was my initial thought. So far, validation spectra reconstruction is best achieved with all calibration factors (/loading vectors). So, I'd consistently be picking the full factor model. ...I think I can change what I'm doing to make this approach work, but I haven't yet figured out exactly how.

2) This is another possibility, but I'm hoping to keep the calibration samples static and only choose which of the available PLS models to apply, rather than which calibration samples to use.

3) Do you have a suggested reference for this?

Thanks for the feedback!

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 28
Registered: 9-2001

Posted on Sunday, July 02, 2006 - 6:46 am:

Forrest - Michael makes some good points.

The first one is to bring up the question of whether you want to do quantitative analysis or qualitative analysis? Your mention of reference values implies that quantitative analysis is what you have in mind, but I think we need to make that explicit, first.

A second question that comes to mind (assuming that the answer to the first question is "quantitative") is: IF you had the reference values, what criteria would you use to select the factors? This comes from thoughts of trying to come up with ways to do that same selection procedure without the reference values.

Hopefully answers to those questions will spark some ideas in my (or someone else's) head. In the meanwhile there are some approaches I can think of:

1) Use the spectra of the validation samples as surrogates for the sample compositions, and use the accuracy to which you can reconstruct the spectra (solely from the factors based on the calibration results) as a indicator of how well you could have predicted the composition. I'd suggest the mean square error or the root-mean-square error of the reconstruction process as the indicator. This is applicable only to full-spectral calibration methods, such as PCR , PLS, FT, wavelets, etc.

2) I gave a talk at the 2004 Eastern Analytical Symposium where I showed how you can use qualitative analysis methods to do quantitative analysis. The idea was to group samples together based on their constituent concentration, then use a qualitative analysis method (Mahalanobis Distance) to identify which group an "unknown" sample belongs to. The more groups you divided the calibration data into, the closer the results approached those a quantitative method gave.

3) You might also think of a way to apply Rolf Marbach's calibration algorithm (which only needs one reference sample, at most) to the validation process. Rolf only presented his idea in terms of calibration, but there may be way to apply it to validation, as well.

In any case, however, as Michael said: if you're in a regulated environment you should expect the regulatory authorities to look askance at any approach that doesn't at least compare the validation results to "those of a well-characterized method, the accuracy of which is known and stated"

Howard

\o/
/_\

Michel Coene (michel)
Senior Member
Username: michel

Post Number: 35
Registered: 2-2002

Posted on Sunday, July 02, 2006 - 4:40 am:

From your description it looks like you wish to do a qualitative classification(to which group does this spectrum belong?) folowed by a quantitative prediction. Classification is normally done without looking at reference values. Both are quite common, though you might need to do some programming to automate the steps. CAMO makes both an online classifier and an online predictor which you could use in a relatively simple Visual Basic program. MATLAB will obviously allow you to do this too. I am not familiar enough with the other packages but I guess most will have some scripting capabilities. You also might want to look at the concept of Locally Weighed Regression, where you drop the calibration part altogether and make a "disposable" calibration for each unknown sample, based on a large database. This method is a little more "fuzzy", and might be harder to validate in a pharma environment.

Forrest Stout (forrest)
New member
Username: forrest

Post Number: 1
Registered: 7-2006

Posted on Sunday, July 02, 2006 - 1:46 am:

Are there any methods out there for selecting a model (e.g. number of PLS factors) based on the validation/cross-validation spectra but not the reference values. More specifically, I'm looking to build models as usual with calibration spectra and calibration reference analyte concentration values, but then pick which model to use based on some characteristic of the validation spectra. (Of course, this eliminates the common method of examining RMSECV, SEP, etc. Also, I want the model selection to be sensitive to the validation spectra.)

Any ideas are appreciated.