Prediction of protein in soybean meal

jhonattan.baez's picture

Hi everybody


I have done a calibration curve for protein in soybean meal,


In the calibration, I have obtained R^2= 94.75 ,  RMSEE= 0.223, RPD=4.36

In the calibration, I have obtained R^2 CV= 93.72 ,  RMSCV= 0.241, RPD=3.99

the calibration contains 120 samples of soybean meal, and lab values for protein were obtained from LECO.

I have some cuestions:

- According to the values of R^2, RMSEE, etc, it could say that is a good calibration?

- I have started with 198 samples, but with the target to achieve a high correlation,  I retired 78 samples, it is bad?

- I have tested my calibration with some samples, and their predictions has a 1% of difference approximately. What can be the problem?

- Do you think that i should do a test validation, rather than cross validation? , if you think this, ishould do the test validation with all the samples (198), or I should do the calibration with the same samples that I used in cross validation?

I apologize for my bad English.


Uploaded Images: 
caporasonicola's picture

To comment on the quality of a calibration you should look at the range: in you case it seems that there only a few points (2? 4?) at the lower range. I have a question on it: do you have duplicate points? Those points seems to be on the same position.
Also excluding those points, you still have a range of ~5%, with an error or 0.24, which doesn’t seem very high, but it depends on the applications you need. It is a cross-validation and therefore to have more realist error you should split your sample data set into a calibration and an independent validation data set.

  • Samples removed: you removed almost 40% of the samples. Why? How did you choose them? Are they outliers for sure?
  • I guess you treated the new samples as the ones you used for the calibration, e.g. grinding at the same particle size etc. I think the error is due to the calibration...
  • Yes, the number of sample is quite high so you can try an external validation instead of cross-validation. Also look again at the samples you have removed.

Best regards

jhonattan.baez's picture

Hi Nick, thanks for your answer.
- Yes each sample has been meassured three times.
- I have retired this samples according with its difference with lab values of protein, I retired the samples with residue more of 0.48 %. It make that the R^2 increased.
- The grinding size is the same for all samples; and the application of calibration is the prediction of protein percentage in soybean meal.
- In other hand, I don´t undestand how I should choose the samples for Test set Validation, there a specific criteria?
- I readed papaers about calibration, the most use test validation rather than cross validation, but nobody say why, its test validation better than cross validation?, its posible that if i use tset validation, I going to obtein better predictions?
Regards from Colombia.

gabiruth's picture

Your range is really 4.5 if we ignore the two low points, however, if you removed so many samples to get to the current R(2) it means that there must be a problem with some parameter. If your range is 4.5 the maximum acceptable uncertainty in the reference data must not exceed 4.5 multiplied by 0.05 so in this case the reference error shall not exceed a value of (+/-)0.112 otherwise the calibration isn't going to be very good. So, please check the quality of the reference data before trying too much chemometric trials also - in reality it is recommended that the error will not exceed 0.04 times the range
Normally the ground rule for removing OUTLIERS when doing PLS1 or any other regression is that you remove up to a maximum of 10% of the samples. If you have to remove 40% there is a problem either with the reference data or the spectra. I would check to see if the samples are really chemically stable and that no enzimatic activity changed the composition between the reference analysis and collecting the spectra. I have seen that samples from a live organic source change with time in particular if they are exposed to air, moisture and temperature and the surface area is large after grinding
I hope this will help you
Gabi Levin, Ph.D.
Brimrose Corporation
Home of the Luminar AOTF spectrometers 

dwhopkins's picture

Hi Nick,
Wecome to the real world of NIR calibration.  It is an on-going activity.  You should be continually testing your models, at a frequency that depends on your sense of risk.  I agree with Gabi's comments, and I have several comments to add.
Cross-validaton is best used to decide on how many factors to use in your PLS, it is not a good validation method.  It is not clear how the CV was performed, but a valid CV should leave out all 3 of the NIR reps at each loop, or you will essentially have the same samples in the calibration as in the validation sample sets.  You should check to see whether your software will allow you to do that.
So, the best idea is to randomly select  samples for set C and set V, in your case maybe 22 for Validation and 44 for Calibration.  Keep all 3 reps together.  Since you only have 2 samples at low protein, I'd keep both of those in the calibration set and accept that the calibration will not be well validated.  You really need to gather more samples, and focus on selecting more at the low range.
Having 3 reps is a wonderful idea, as it can tell you the reproducibility of the NIR method.  Use  Pooled Std Dev of Replicates = sqrt([Sum(SD^2)]/Nrep) , you may have to use Excel.  To find out the Std Error of the Lab method (SEL), you could use the same calculation on triplicate results from the Leco.  With soybean meal, you may have consdierable sampling error.  I hope you are using samples of 1 gram for the Leco.  The use of 0.1 g would be bad, because of sample non-homogeneity at that level.  Generally speaking, you may get better return for your effort to only do your samples in duplicate, for both the lab and NIR.  Then calculate SDD = sqrt([[Sum(diff^2)]/(2*N) ) for both lab and NIR, where diff means the difference between each pair of readings and N is the number of paired readings.
Best wishes,
Dave Hopkins