MATH 376 -- Probability and Statistics II
How can we tell if sampled data are coming from a normal population?
 

February 25, 2010 

 

In deriving the small-sample formulas for confidence intervals for 

population means and differences of population means, recall that 

we had to make some rather restrictive additional hypotheses in  

order to apply the facts we know about  t-distributions: 

 

   *  we need to assume that the samples are drawn from  

      normal populations 

 

   *  for the difference of means case, we need to assume  

      the two population variances are the same. 

 

We have seen how to derive some evidence about whether the  

equality of population variances is reasonable, via our confidence 

intervals for  `/`(`*`(`^`(sigma[1], 2)), `*`(`^`(sigma[2], 2))) (derived using the F-distribution).   

 

But this still leaves some questions:   

 

    1)  Can we test whether it is reasonable to assume the samples 

         are coming from a normal population? 

 

    2)  How much of a difference does it make if they are not? 

 

The answer to the second question is basically that it can make 

quite a bit of difference depending on how much the distribution 

deviates from normality.   The distribution of    `/`(`*`(`+`(Ybar, `-`(mu)), `*`(sqrt(n))), `*`(s)) 

can be very different from a t-distribution if the underlying distribution  

is not normal.  Trying to use the formulas we have derived in a case like  

that can yield misleading and/or meaningless results! 

 

Thus, it becomes important to know that there are ways--some more 

intuitive, some more precise--to get some indication whether it is 

reasonable to make the assumption that a collection of sampled data
has been drawn from a normal population.

The first way we will discuss is a graphical method called the
 

Normal Probability Plot.  The idea here is to compare the distribution 

of the sampled data values with the distribution we would expect 

from a normal population in a particular way -- we order the 

sample and compare those ordered values to the  

``normal scores'' --  the expected values of the order statistics for
a sample of the same size drawn from a normal population.     
Here
are the necessary calculations.  

The standard normal pdf  and  cdf  first:
 

`assign`(f, proc (y) options operator, arrow; `/`(`*`(exp(`+`(`-`(`*`(`/`(1, 2), `*`(`^`(y, 2))))))), `*`(sqrt(`+`(`*`(2, `*`(Pi)))))) end proc); 1 

proc (y) options operator, arrow; `/`(`*`(exp(`+`(`-`(`*`(`/`(1, 2), `*`(`^`(y, 2))))))), `*`(sqrt(`+`(`*`(2, `*`(Pi)))))) end proc (1)
 

`assign`(F, proc (y) options operator, arrow; int(f(t), t = `+`(`-`(infinity)) .. y) end proc); 1 

proc (y) options operator, arrow; int(f(t), t = `+`(`-`(infinity)) .. y) end proc (2)
 

The following computes the expected values for the order statistics
of a sample of size  
n  from a standard normal population, using our 

formulas for the  pdfs  of the order statistics: 

 

`assign`(NormalScores, proc (n) options operator, arrow; [seq(evalf(Int(`/`(`*`(y, `*`(factorial(n), `*`(`^`(F(y), `+`(k, `-`(1))), `*`(`^`(`+`(1, `-`(F(y))), `+`(n, `-`(k))), `*`(f(y)))))), `*`(facto...
`assign`(NormalScores, proc (n) options operator, arrow; [seq(evalf(Int(`/`(`*`(y, `*`(factorial(n), `*`(`^`(F(y), `+`(k, `-`(1))), `*`(`^`(`+`(1, `-`(F(y))), `+`(n, `-`(k))), `*`(f(y)))))), `*`(facto...
 

proc (n) options operator, arrow; [seq(evalf(Int(`/`(`*`(y, `*`(factorial(n), `*`(`^`(F(y), `+`(k, `-`(1))), `*`(`^`(`+`(1, `-`(F(y))), `+`(n, `-`(k))), `*`(f(y)))))), `*`(factorial(`+`(k, `-`(1))), `... (3)
 

`assign`(NS12, NormalScores(12)); 1 

[-1.629227640, -1.115732184, -.7928381991, -.5368430214, -.3122488787, -.1025896798, .1025896798, .3122488787, .5368430214, .7928381991, 1.115732184, 1.629227640]
[-1.629227640, -1.115732184, -.7928381991, -.5368430214, -.3122488787, -.1025896798, .1025896798, .3122488787, .5368430214, .7928381991, 1.115732184, 1.629227640]
(4)
 


Let's apply this to the following collection of 12 net incomes from
a random sample of tax returns:`assign`(Incomes, [7.8, 9.7, 10.6, 12.7, 12.8, 18.1, 21.2, 33.0, 43.5, 51.1, 81.4, 93.1]); 1
 

[7.8, 9.7, 10.6, 12.7, 12.8, 18.1, 21.2, 33.0, 43.5, 51.1, 81.4, 93.1] (5)
 

read  

 

Warning, the use of _seed is deprecated.  Please consider using one of the alternatives listed on the _seed help page.
3601594484 (6)
 

The normal probability plot is just the scatter plot of the points
with the income as
x  and the corresponding normal score as y: 

ScatterPlot(Incomes, NS12); 1 

Plot_2d
 


What are we looking for??   Well, if the data
are coming from a normal
population, then we would expect the scatter plot to be  
nearly a straight line.
Some ``random up and down variation'' from that line would also be expected
because of chance errors coming from the sampling process.

In this example, though, notice that there is an apparent systematic  
curvature 

in the scatter plot.   If you see this kind of pattern, it is a strong indication
that assuming normality of the underlying population is
probably not justified!!
 

For now, we will leave this discussion at this somewhat intuitive level.  We will 

return to these questions later on, in our discussion of regression (finding a 

line that ``best fits a collection of data'')  and the correlation coefficient that 

we discussed briefly last semester in connection with covariances.   

For your general ``statistical literacy,'' it is also important to be aware that 

there are a number of other much more sophisticated statistical normality 

tests that are commonly used to get information in this situation.   Maple has
one very good one (called the Shapiro-Wilk W Test) as part of its Statistics
package.

 

with(Statistics); -1 

 

`assign`(infolevel[Statistics], 1); 1; ShapiroWilkWTest(Incomes, level = 0.5e-1); -1 

 

1
Shapiro and Wilk's W-Test for Normality
---------------------------------------
Null Hypothesis:
Sample drawn from a population that follows a normal distribution
Alt. Hypothesis:
Sample drawn from population that does not follow a normal distribution

Sample size:             12
Computed statistic:      0.817741
Computed pvalue:         0.0131943

Result: [Rejected]
There exists statistical evidence against the null hypothesis
 


We will discuss some of the language involved here -- :"null hypothesis, alternative hypothesis,
 

p-value, etc."  after Spring Break.  For now, let us just close by interpreting the results here --  

what this is saying is that there is actually pretty strong evidence to reject the assumption that 

these samples were coming from a normal population!