NIR Discussion Forum: Normal distribution or not? Please help

Normal distribution or not? Please help Log Out | Topics | Search
Moderators | Register | Edit Profile

NIR Discussion Forum » Bruce Campbell's List » I need help » Normal distribution or not? Please help

« Previous Next »

Author

Message

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 498
Registered: 9-2001

Posted on Thursday, October 25, 2012 - 8:43 am:

Tony - I suspect that Marijana got a bit overwhelmed by all the arguments in the comments, and decided to just write her paper. But still, whe don't you just chime in, instead of asking permission, and then we can all have fun arguing about them, too!!

\o/
/_\

Tony Davies (td)
Moderator
Username: td

Post Number: 285
Registered: 1-2001

Posted on Thursday, October 25, 2012 - 6:40 am:

This thread appears to have gone cold. (Sorry I didn't have time to comment earlier).

Are you still interested Marijana?

Best wishes,

Tony

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 496
Registered: 9-2001

Posted on Wednesday, September 05, 2012 - 8:35 am:

Venky - I'm not familiar with that book, so I don't know which t, chi-square, or f-tests they are comparing. But the fundamental basis underlying ALL statistical testing procedures are the expected results from the different probability distributions, using any specified test.

Since each test can only determine whether the data at hand did NOT come from a specified distribution, or that it is consistent with that distribution (which does NOT automatically mean that it did come from that distribution, only that it COULD have), it would be necessary to show the follwing:

The data is consistent with a Normal distribution.
The data did not come from a uniform distribution.
The data did not come from a t distribution.
The data did not come from an f distribution.
The data did not come from a binomial distribution.
The data did not come from a chi-square distribution.
--etc for all distributions other than Normal---

When there's not enough data, then you can't demonstrate that it could not have come from all those other distributions (hate the double negative, but that's how these things work). Even if you fail to show that it's not, say, t-distributed, or uniformly distributed, then at best your result is ambiguous, it COULD be Normal, but it also COULD be uniform, or it COULD be t-distributed, etc.

Conceivably one of the tests the book recommends might weed out one or two alternative distributions, if the test really is better, but when the data is insufficient it seems most umlikely that you'll come up with an unambiguous result no matter what you do.

That's where the Bayesian approach comes in. It can tell you which distribution is most likely to be the source of the data, but that's about the best you can do for an objective, mathematically rigorous, answer.

\o/
/_\

venkatarman (venkynir)
Senior Member
Username: venkynir

Post Number: 157
Registered: 3-2004

Posted on Wednesday, September 05, 2012 - 12:07 am:

Hi Howard ;
Kindly refer book " Data handling in science and Technology 20A" -Handkbook of chemometrics and Qualimetrics PART -A. Here is the object is " protein content distribution within calibration range" . He want to say that "distribution is poor for sunflower meal" . For that t,chi-square or f-test or Judgement method by is ranking better .
I hope you will agree with me .

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 495
Registered: 9-2001

Posted on Tuesday, September 04, 2012 - 8:38 am:

Marjana - I'm afraid I have to respectfully disagree with Venky. Given the small amount of data, I still don't think you can tell unambiguously what the distribution of the parent data is. What you possibly might be able to do, though is, given the data, to tell which distributions are more likely (probable) than others.

From the wording of that sentence, then, you have to know that if there's any solution at all, it's going to involve Bayesian statistics, if you think your command of Bayesian statistics is up to snuff.

\o/
/_\

venkatarman (venkynir)
Senior Member
Username: venkynir

Post Number: 156
Registered: 3-2004

Posted on Monday, September 03, 2012 - 10:46 pm:

Dear Maijana ;
Seen your pdf .If it is in process we can extract few meanings about normal distribuiton like process capability index other things here it would not carry much emaning , as David suggested you can test through t,z and other test so that we can find distribuiton and your assumption true or not like Type of error .

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 494
Registered: 9-2001

Posted on Monday, September 03, 2012 - 11:39 am:

Marijana - David makes a good point, which I'll restate a little more generally. You haven't said WHY you want to know if those values are Normally distributed.

I think David did a bit of jumping to a conclusion, that you plan to use these samples for calibrating an [NIR} instrument to do analysis. If his assumption is correct then his conclusion is correct: a Normal distribution is NOT the best one to try for, in that case you want a uniform distribution of values.

If you plan to do something else with these samples (and the data therefrom), however, then a normal distribution may be the best one, or it still may not be; you may be best off with a still different distribution of values. But as with many questions of this type, the "right" answer depends on the application.

Howard

\o/
/_\

David W. Hopkins (dhopkins)
Senior Member
Username: dhopkins

Post Number: 219
Registered: 10-2002

Posted on Monday, September 03, 2012 - 10:22 am:

Hi Marijana,

I'd just like to add some observations to the discussion. I think that concerns about whether the distribution is normal or not are ill-directed. To me, the 55 samples of SBM show that there are 3 classes of protein content, and the samples may be biassed a bit more than ideal toward the high range for generating a protein calibration that will apply across the range of protein content. I am interested that your samples show roughly the same range for SBM products as are produced here in the USA.

On the other hand, the SFM product shows less diversity than the SBM, and the histogram suggests that the 38 samples really don't have the flat range we'd like to have to obtain a good calibration for proceeding onward in measuring samples. It would be desirable to fill out the distribution with high and low protein samples. These 38 samples would be difficult to split into a calibration and validation set of samples. I think you have a bare minimum for a feasibility study. On the other hand, these feed applications are well established, and these data should be sufficient to establish the capability of your instrumentation to make the measurements.

Did you have a reasonable range in moisture contents to attempt moisture calibrations for the two meals?

Best wishes,
Dave Hopkins

Howard Mark (hlmark)
Senior Member
Username: hlmark

Post Number: 493
Registered: 9-2001

Posted on Saturday, September 01, 2012 - 2:40 pm:

Marijana - I looked at the histograms. Presumably you want an objective test of Normality, so that you can say whether those data could have been selected (at random) from a Normal distribution. The main finding I get from them is that there are 55 soybean samples and 52 sunflower samples.

This is important. 50+ samples is a good amount for doing a Z-test or t-test of means, a marginal amount if you're doing a chi-square or f-test of variances, but hopelessly small if you want to test distributions.

Otherwise, Julio's idea is basically sound but founders on the reef of too-few samples.

A good objective statsitical test for distributions is the Contingency Table test for equality of distributions. The details can be found in chapter 23 of "Statistics in Spectroscopy", but is based on comparing the distribution of the actual data with a theoretical distribution having the same mean and standard deviation. A requirement of this test is that each bar of the histogram has to have a minimum of 5 samples in it.

With only 50 samples there can be at most ten bars (and probably less, since the data bunches up in certain bars), and that makes for a very weak test. In your histograms, there are only 11 (max) bars, even allowing only two samples under the bar in several cases.

The problem with having a weak test comes out of the inherent nature of statistical testing. You can never show that a data set is Normal, you could only show that it's NOT Normal (assuming that's the case). Othewise you can only show that the data is CONSISTENT with being drawn from a Normal distribution.

When the test is weak, whether from too-few samples or any other cause, then almost ANY set of data will be consistent with being Normal. The other side of that coin is the problem you then run into: the data set will then also be consistent with being uniformly distributed, t-distributed, follow a binomial distribution, Poisson distribution, chi-square distribution, or almost any other distribution you would test it against. You can't tell the difference, it could have come from ANY distribution. And that's not useful.

\o/
/_\

Gabi Levin (gabiruth)
Senior Member
Username: gabiruth

Post Number: 75
Registered: 5-2009

Posted on Saturday, September 01, 2012 - 11:09 am:

Mariajna,

None look like normal, but the sunflower seems somewhat better than the soybean.
The high occurrence in the soybean is on the high side - now if this is a special genetic version that tends to have high protein it can be understood, and the low occurence of the low protein is not unusual - could be the result of some cross polination or similar.

Gabi Levin

Marijana Maslovaric (vidra)
Member
Username: vidra

Post Number: 15
Registered: 5-2011

Posted on Saturday, September 01, 2012 - 8:30 am:

Thank you very much.
I'm just writing a short paper and I would like to say something about the protein content distribution within calibration range. Maybe I should just say that the distribution is poor for sunflower meal, and to point out the range within lies the largest number of samples.

Julio Trevisan (lascanter2010)
Member
Username: lascanter2010

Post Number: 15
Registered: 8-2010

Posted on Saturday, September 01, 2012 - 8:10 am:

Although the soybean does look normal, whereas the sunflower does not, in a scientific context, you should rather perform a normality hypothesis test.

http://en.wikipedia.org/wiki/Normality_test

Why do you need to determine whether they are normal distributions?

Marijana Maslovaric (vidra)
Member
Username: vidra

Post Number: 14
Registered: 5-2011

Posted on Saturday, September 01, 2012 - 7:56 am:

I'm sending two charts, and I'm wondering if any of these distributions can be considered as a normal distribution.

histogram_soybean meal
histogram_soybean meal.pdf (39.1 k)

histogram_sunflower meal
histogram_sunflower meal.pdf (37.7 k)