NormalityTests1.html

MATH 376 -- Probability and Statistics II
How can we tell if sampled data are coming from a normal population?

February 25, 2010

In deriving the small-sample formulas for confidence intervals for

population means and differences of population means, recall that

we had to make some rather restrictive additional hypotheses in

order to apply the facts we know about t-distributions:

* we need to assume that the samples are drawn from

normal populations

* for the difference of means case, we need to assume

the two population variances are the same.

We have seen how to derive some evidence about whether the

equality of population variances is reasonable, via our confidence

intervals for (derived using the F-distribution).

But this still leaves some questions:

1) Can we test whether it is reasonable to assume the samples

are coming from a normal population?

2) How much of a difference does it make if they are not?

The answer to the second question is basically that it can make

quite a bit of difference depending on how much the distribution

deviates from normality. The distribution of

can be very different from a t-distribution if the underlying distribution

is not normal. Trying to use the formulas we have derived in a case like

that can yield misleading and/or meaningless results!

Thus, it becomes important to know that there are ways--some more

intuitive, some more precise--to get some indication whether it is

reasonable to make the assumption that a collection of sampled data
has been drawn from a normal population.

The first way we will discuss is a graphical method called the

Normal Probability Plot. The idea here is to compare the distribution

of the sampled data values with the distribution we would expect

from a normal population in a particular way -- we order the

sample and compare those ordered values to the

``normal scores'' -- the expected values of the order statistics for
a sample of the same size drawn from a normal population. Here
are the necessary calculations.

The standard normal pdf and cdf first:

proc (y) options operator, arrow; `/`(`*`(exp(`+`(`-`(`*`(`/`(1, 2), `*`(`^`(y, 2))))))), `*`(sqrt(`+`(`*`(2, `*`(Pi)))))) end proc (1)

(2)

The following computes the expected values for the order statistics
of a sample of size n from a standard normal population, using our

formulas for the pdfs of the order statistics:

(3)

(4)

Let's apply this to the following collection of 12 net incomes from
a random sample of tax returns:

(5)

Warning, the use of _seed is deprecated. Please consider using one of the alternatives listed on the _seed help page.

(6)

The normal probability plot is just the scatter plot of the points
with the income as x and the corresponding normal score as y:

Plot_2d

What are we looking for?? Well, if the data are coming from a normal
population, then we would expect the scatter plot to be nearly a straight line.
Some ``random up and down variation'' from that line would also be expected
because of chance errors coming from the sampling process.

In this example, though, notice that there is an apparent systematic curvature

in the scatter plot. If you see this kind of pattern, it is a strong indication
that assuming normality of the underlying population is probably not justified!!

For now, we will leave this discussion at this somewhat intuitive level. We will

return to these questions later on, in our discussion of regression (finding a

line that ``best fits a collection of data'') and the correlation coefficient that

we discussed briefly last semester in connection with covariances.

For your general ``statistical literacy,'' it is also important to be aware that

there are a number of other much more sophisticated statistical normality

tests that are commonly used to get information in this situation. Maple has
one very good one (called the Shapiro-Wilk W Test) as part of its Statistics
package.

Shapiro and Wilk's W-Test for Normality
---------------------------------------
Null Hypothesis:
Sample drawn from a population that follows a normal distribution
Alt. Hypothesis:
Sample drawn from population that does not follow a normal distribution

Sample size:             12
Computed statistic:      0.817741
Computed pvalue:         0.0131943

Result: [Rejected]
There exists statistical evidence against the null hypothesis

We will discuss some of the language involved here -- :"null hypothesis, alternative hypothesis,

p-value, etc." after Spring Break. For now, let us just close by interpreting the results here --

what this is saying is that there is actually pretty strong evidence to reject the assumption that

these samples were coming from a normal population!