15. Is overfitting based only on noise?

Here is a theoretical question. I was talking with a statistician on staff, and I wondered if overfitting could ever result if the calibration spectra had absolutely no noise. The statistician thought there could be, as regression mathematics are basically approximations. Eventually, he said, there would be some overfitting. I am not entirely convinced as in the strict sense, with perfect data, I cannot see overfitting. Perhaps there may be the situation where additional factors, say in principal component regression, would not add useful information, but at the same time, the information would not be harmful, but be more of a waste of computation time.

The statistician said he would put some money on his conclusion. I didn't take the bet, but if anyone can prove him wrong, I wouldn't be against taking the money and spending it on refreshments at our November meeting. So, any comments?

Bruce Campbell

My only comment would be that most regression packages do not process perfect data (i.e., absolutely no noise) well. I don't think it could over fit because the higher order factors would quickly approach zero. Besides, define over fit. If I predict a constituent to the 6th decimal place and there is uncertainty in the 7th, even if my data is accurate and precise to that degree, is this over fit? This would be worth a beer to solve. Howard Mark is probably the best person to answer but maybe I will corral one of our statisticians.

I like Richard Kramer's answer, because it points out that it may be possible that both he and the statistician might be correct. There are other sources of error than the spectrometer, and frequently the most trouble is caused by errors in the reference method. At least there has to be round-off error in the reference data and NIR instrument measurements. Sometimes the errors are increased by inappropriate rounding (use of too few significant figures).

Richard, do you (or anyone else) think it is possible to use so many factors in a PLS regression that the calibration error (SEC or SECV) is less than the round off error in the reference data? That would be overfitting the data.

Dave Hopkins

Well, if you accept a definition for overfitting as "counter-productive fitting by the regression (of whatever type) to the noise in the data" then, it becomes immediately apparent that it is not possible to have overfitting to noise-free calibration data.

Regards,

Richard Kramer

On Wed, 27 Aug 1997, David W. Hopkins wrote:

Richard, do you (or anyone else!) think it is possible to use so many factors in a PLS regression that the calibration error (SEC or SECV) is less than the round off error in the reference data? That would be overfitting the data.

Simply, yes. But this would not be a case of noise-free data.

Regards,

Richard Kramer

I would say the answer depends on your definition of overfitting. If overfitting is defined as fitting the set of calibration samples, while resulting in an equation which does not predict the validation or test set correctly, then I'd say that it is definitely possible under any cirumstances. If for X samples, X wavelenghts are used, the calibration data will be fitted perfectly, but the calibration could be useless. This is how I think of overfitting, does it work on other samples. Another possibility is to think in terms of perfect spectra. Even with no noise, spectra would not be likely to be perfect (absolutely representative of the sample). There would still be differences in consecutive runs of the same sample due to packing differences, particle size variations, sampling, etc. These are not noise, but can be taken into account in the equation and I believe would result in overfitting.

Jim Reeves