NIR Discussion Forum: OPLS-DA - good spearation but bad Q2

OPLS-DA - good spearation but bad Q2 Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » Chemometrics » OPLS-DA - good spearation but bad Q2

« Previous Next »

Author

Message

Tony Davies (td)
Moderator
Username: td

Post Number: 154
Registered: 1-2001

Posted on Tuesday, May 22, 2007 - 2:24 pm:

Hi Bernard,

Just two minor comments:
26 and 22 objects is not great but you should be very careful about deleting outliers. Can you recognise them as outliers in the original spectra? If yes; probably OK, if NO; worry!
Cross-validation is NOT the same as separate training and validation sets. All sorts of things can go wrong but I agree that you may have too few for separate sets.
I'm not very keen on SIMCA but it should be OK for two groups. How many PCs are you using? A common error is not to use enough.

Note to David:
Hello Dave,
emf (Enhanced Windows Metafile) is a standard windows graphic format but very ancient! It will insert OK into a Word file.

Best wishes,

Tony

David W. Hopkins (dhopkins)
Senior Member
Username: dhopkins

Post Number: 112
Registered: 10-2002

Posted on Tuesday, May 22, 2007 - 11:38 am:

Hi Bernard,

I thought that you were doing a process I know as PLS-DA, and you may be. This appears to be a scores plot, and there is considerable overlap of the 2 populations. If you draw circles around the red points and the blue points, the circles have a considerable area in common. They would not overlap at all for a good separation. A PCA plot of the combined data for both populations would look substantially like this, so there is no point in doing that, as I suggested earlier.

I am largely unfamiliar with the terminology you are using. What software package are you using?

Please upload a plot of the original scans, that might help to suggest possible spectral pretreatments or wavelength selections that might help.

For others who may like to observe the data, I had no idea what a *.emf file might be. I found that by renaming the file *.jpg, I could view it.

Best regards,
Dave

Bernard North (bnorth)
New member
Username: bnorth

Post Number: 2
Registered: 5-2007

Posted on Tuesday, May 22, 2007 - 8:12 am:

Dear Gavriel and David,
many thanks for your prompt suggestions to my query which were very helpful.
sorry not to get back earlier.
I do have 26 and 22 objects in each category which I thought might be enough. I'm not using a separate validation set - I thought that the cross-validation of Q2 was a similiar idea ?
Perhaps our data is numerous enough to split into test and training, I'll see if SIMCA does that, I know R does.
Ihe suggestions re a truth table and separate PCA models are very good and I'l also try and find out what our reference method error is.
I'm uploading an example - I thought I had a better plot before where the x axis does, as you say, seem to separate into 2 groups. There is overlap on the y axis but I thought that was for the OPLS-DA "explaining x" score rather than the x axis "discriminating y" score and so was to be expected ?
many thanks again

Score Scatter Plot [M2].emf (19.1 k)

Gavriel Levin (levin)
Senior Member
Username: levin

Post Number: 35
Registered: 1-2006

Posted on Friday, May 18, 2007 - 6:44 am:

Hi Guys,

I must admit, I have never tried to do anything with R2 of 0.3 (R approximately 0.55) or even with R2 of 0.5 because in my efforts to provide a user with a reliable long standing tool for controlling his prcoess we believe that R2 of 0.8 is practially a must.

My immediate suggestion would be to check the following:

1. Number of samples shall be >20.
2. Range of values between minimum and maximum value shall be at least 10 times the reference method error.
3. Do the loading weights in wavelength show high contributon to the regression where the measured specie stronly absorbs.

If all conditions are met, than it is worth while to continue the effort to improve the results, and they should.

Since we have no knowledge about answers to these questions, I prefer not to speculate.

Thanks,

Gabi Levin
Brimrose

If both above conditions are met, than there should be

David W. Hopkins (dhopkins)
Senior Member
Username: dhopkins

Post Number: 111
Registered: 10-2002

Posted on Friday, May 18, 2007 - 12:38 am:

Hi Bernard,

I have been thinking more about your question, and I'm not sure I got to the heart of it. Seems you expect better results, if your plot looks like you have good separation.

What I was saying is, with an R2 of 0.3, I don't understand why you say you apparently have good discrimination. Perhaps your eye is playing tricks, because the x-axis shows 2 clear groups. You have to look that the 2 groups are non-overlapping in the Y-axis. Can you upload a plot of your results, so we can see what you are seeing?

Another question, how many factors are you employing for your prediction equation?

If you do a PCA of your data set, can you see 2 groups in the early factors? If not, you may have to look to do some useful data pretreatment.

Best wishes,
Dave

David W. Hopkins (dhopkins)
Senior Member
Username: dhopkins

Post Number: 110
Registered: 10-2002

Posted on Thursday, May 17, 2007 - 11:54 am:

Hi Bernard,

There is nothing wrong with 250 spectral variables, as long as they encompass a useful spectral range. More to the point may be the question, how many samples do you have? You need to have enough so that you can trust your calibrations. Do you have a separate set of validation samples?

You can have a pretty bad R-squared value and still have a useful calibration, if the validation sample set statistics are consistent with the calibration set. I'd say, the worse the R2, the more samples I'd want to have in Sets C and V.

An R2 of 0.3 is not very good. Perhaps you can get an idea of whether you have a useful calibration by constructing a truth table, percent predicted + and were +, predicted + and were -, percent predicted - and were -, and predicted - but were +. If your discrimination gives as good a separation as you think, you should obtain only 5% error rates. You have to decide what error is acceptable.

I hope this helps. Perhaps others will have a different take on your question.

Best wishes,
Dave

Bernard North (bnorth)
New member
Username: bnorth

Post Number: 1
Registered: 5-2007

Posted on Thursday, May 17, 2007 - 10:43 am:

sorry if this is a silly question.
I am trying to discriminate two groups on the basis of 250 ish spectroscopic variables.
I have a small Q2 of 0.04 (R2 0.3) but the predictive score vs orthogonal score plot seems to show good separation of the two groups according to the predictive score (-ve for 1 group, +ve for the other).
Why is this please.

I should say I, rightly or wrongly, have already deleted 2 outliers (1 well outside the Hotelling T ellipse and the other was sat in the wrong group in the scores plot).
Before I deleted these two points the Q2 was negative (-0.04).
many thanks in advance for view ...