Single vs duplicate scans in PLS Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » I need help » Single vs duplicate scans in PLS « Previous Next »

Author Message
Top of pagePrevious messageNext messageBottom of page Link to this message

Niklas Warne
Posted on Wednesday, October 05, 2005 - 4:30 am:   

Hi,

Thanks for all comments and suggestions.

Regards
Niklas
Top of pagePrevious messageNext messageBottom of page Link to this message

Tony Davies (Td)
Posted on Monday, September 19, 2005 - 2:11 pm:   

Hi Dave!

You just beat me to it (I should have made the dog wait for her walk a little longer!).

It is pleasing to see how much we agree.

Best wishes,

Tony
Top of pagePrevious messageNext messageBottom of page Link to this message

Tony Davies (Td)
Posted on Monday, September 19, 2005 - 11:50 am:   

Hi Niklas,

Comments and Questions

Comments
1) You really are close to the limit with only 36 samples for a PLS calibration.
2) The large increase in the RMSEP (3 times the RMSEC) indicates over-fitting.
3) Twenty factors (in spite of Unscrambler's naming they are not PCs) is a large number; twenty factors with 36 samples is too many.
4) I assume you are using cross-validation. You need to make sure that you exclude both of your duplicate samples when you run the 72 idividual samples.

Questions
1) Do you know the repeatability of your reference method?
2) What answers do you get using 72 spectra and 10 factors?
3) What sort of range do you have? Compared to the error of the reference method, 10 or more?

Best wishes,

Tony
Top of pagePrevious messageNext messageBottom of page Link to this message

David W. Hopkins (Dhopkins)
Posted on Monday, September 19, 2005 - 11:44 am:   

Niklas,

It appears that you have a lot of sampling error in the NIR, and it is likely that you may have a lot of sampling error in the lab values too. Do you have an idea what the SEL for the lab procedure is?

If you still have the samples and they are stable, it might be a good idea to measure them an additional 3 or 4 replicates each, to get a better value of the repeatablility of the NIR measurements. You may want to revisit your sampling procedure to be sure you are doing the best possible job. Maybe a grinding or stirring step would yield a better measurement that would justify the added handling?

You may find that you will obtain more robust calibrations if you do not use the entire 400-2500 nm range. Usually the visible region just adds trouble to calibrations, unless you are measuring a variable that is strongly related to color (like chlorophyll content).

It seems to me that the RMSEP is rather higher than I'd expect from the SEC, so I'd recommend looking at whether the validation sample set is representative of the calibration set. How many samples did you select for each set?

Did you leave out 2 samples at a time during your cross-validation? If you leave out just one, with duplicate scans you will certainly obtain better RMSECV than expected.

Hope these thoughts will help you improve your feasibility evaluation.

Best wishes,
Dave
Top of pagePrevious messageNext messageBottom of page Link to this message

hlmark
Posted on Monday, September 19, 2005 - 11:40 am:   

Niklas - I had some more thoughts, by way of explanation:

All the available software packages use the standard formulas, which are based on the statistical assumption that the Y-values (reference lab values) corresponding to each reading was measured independently. In that case there are indeed n DF for the reference data, and the formulas that used in the software packages, with n-1, or n-m-1, or whatever, in the denominator, are based on that assumption being true. But if there are only n/2 reference values, as in the case of your message, then the correct denominator should be (n/2)-1, or (n/2)-m-1, etc.

So with 20 PCs, the software is calculating, for example:
sqrt (sum (Y-y{bar})^2 / (n-m-1))
= sqrt (sum (Y-y{bar})^2 / (72-20-1))
= sqrt (sum (Y-y{bar})^2 / (51))

when it should be calculating:
sqrt (sum (Y-y{bar})^2 / ((n/2)-m-1))
= sqrt (sum (Y-y{bar})^2 / (36-20-1))
= sqrt (sum (Y-y{bar})^2 / (15))

So the results will be smaller than they should be by a factor of sqrt(15/51), which seems about right

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

hlmark
Posted on Monday, September 19, 2005 - 10:21 am:   

Niklas - the statistics of calibration using duplicate readings of the spectra but the same reference values is very tricky, and I don't think the statistical characterization of this situation has been analyzed.

There are two countervailing effects going on:

On the one hand, by including duplicate spectra in the calibration, there is opportunity for the model to accommodate whatever physical (or other) effects present that create the spectral differences, and include some corrections for them in the model. There is also another opportunity, this one is for you: to calculate, from the duplicate predictions of each sample, the amount of error due to the uncorrected parts of the physical effects; ideally, if the model is indeed correcting for those effects, then the standard deviation due to this error should be small compared to the SEP and SEC. See the recent exchanges in a different thread, for a discussion of this - the formulas are the same as for computing reference lab error from duplicate readings.

On the other hand, using the same reference laboratory values for both spectra means that there are not as many degrees of freedom in the reference lab values as there are spectra, only as many as you have independent lab values. The consequence of this is that the residuals from the calibration do not conform to one of the fundamental requirements of regression: that the errors be random and independent. In this case the residuals (and the errors) are not independent because they are exactly correlated with the predicted Y values. So you don't have as many true degrees of freedom in the data as the formulas used to compute SEC and SEP are based on, and the resulting computation will definitely be optimistic (in statistic-speak they have become "biased" statistics).

Overall, I would say that your calibration results based on the averaged spectra will be more reliable. On the other hand, if you also averaged the prediction data, then the value of SEP you computed will also be optimistic compared to what you can expect to achieve when running unknowns in routine use, unless you also measure two spectra and average them the same way you're doing during the calibration.

Howard

\o/
/_\
Top of pagePrevious messageNext messageBottom of page Link to this message

Niklas Warne
Posted on Monday, September 19, 2005 - 9:42 am:   

Hi everybody,

I'm conducting a feasibility study based on 36 samples each scanned in duplicate. IP implications prevents me from disclosing too much information, but it's a quantitative application related to what we can call "overall quality" in a certain food matrix. If I use all 72 sample scans and perform a PLS based on 400-2500 nm and MSC, I get RMSEC and RMSEP of 0,29 and 0,90, respectively. The software suggests 20 PCs, which does seem a lot, but for each added PC both RMSEC and RMSEP decreases. However, if I average the duplicate scans and use 36 samples in the PLS model, I get the following figures: RMSEC 0,97, RMSEP 1,82, based on 10 PCs. After this, RMSEP starts to increase. In both cases, the plot of regression coefficients looks a bit spiky. Is it just a case of gross overfitting using duplicate scans? Or something else? I'm using Unscrambler software. Any suggestions are welcome.

Regards
Niklas

Add Your Message Here
Posting is currently disabled in this topic. Contact your discussion moderator for more information.