HeightData.html

Mathematics 375 -- Probability and Statistics I

Descriptive Statistics -- Maple probability and statistics package demo

September 5, 2003

> read "/home/fac/little/public_html/ProbStat/MaplePackage/MSP.map";

The whole class height data, entered as a Maple list data structure.

The first 15 entries are the 15 men in the class; the next 20 are the

20 women.

> AllHeights:=evalf([68,73,69,68,72,69,71,73,68,72,78,70,70,75,71,67,68,62,64,62,60,66,66,68,69,68,68,68,65,64,72,64,64,64,66]);

Maple note: AllHeights is a Maple list with real number entries. The evalf

takes the integers on the input line and turns them into decimal numbers.

You could also enter all (or some) of the numbers with decimal points and

get the same results. If we didn't do that, Maple would try to express

the mean, standard deviation, etc. exactly (as fractions with square

roots, rather than in decimal form).

> nops(AllHeights);

The relative frequency histogram for the whole-class height data:

> Hist(AllHeights,60,79,20);

> Hbar:=Mean(AllHeights);

> s:=StandardDeviation(AllHeights);

Normal distributions

Many quantities such as heights in a population tend to be distributed ``normally,''

with a distribution described by what is called a normal probability density

function, which include parameters = population mean, and

= population standard deviation. We'll study the normal density functions

in great detail later this semester. For now, here is the normal probability density

with mean Hbar = 68.71794... and standard deviation s = 4.07139 ... plotted

together with the relative frequency histogram of the heights:

> with(plots):

Warning, the name changecoords has been redefined

> HH:=Hist(AllHeights,60,79,20):

> DF:=plot(x->NormalPDF(Hbar,s,x),55..84,color=blue):

> display(HH,DF);

Maple note: This is the standard method for combining two or more plots

into a single plot. The display procedure is contained in the plots

package, which is loaded with the with command. Execute the separate

plotting commands one at a time and assign the values to names. Then

display the plots together using display.

The "empirical rule"

A "quick and dirty" approximation known as the "empirical rule" says that

for a quantity that is normally distributed with mean and standard

deviation there is a 68% chance that a randomly selected sample

will be within 1 standard deviation of the mean (i.e .between

and ) and a 95% chance that a randomly selected sample

will be within 2 standard deviations of the mean (i.e. between

and ). How well does that fit our data?

> OneSD:=op(Frequencies(AllHeights,Hbar-s,Hbar+s,1));

> TwoSD:=op(Frequencies(AllHeights,Hbar-2*s,Hbar+2*s,1));

> evalf(OneSD/nops(AllHeights));

> evalf(TwoSD/nops(AllHeights));

Not too bad!! Even if a data set is not exactly normally distributed,

the empirical rule can often yield reasonable estimates.

Maple note: The Frequencies procedure can be used to count the

number of numbers in a list in any number n of "bins" on any interval

[a,b] . The general format is Frequencies(list,a,b,n);

The output of this is a list with n entries. Here, I made n = 1,

so to get that one number alone (not as a list with one entry), I

used the Maple op built-in function. "op" of a list is the sequence

of entries in the list, separated by commas if there are more than

one.

A "look ahead" to topics we will study later in this semester and in the spring

Often, we will consider problems where a random sample, like the height

data for the whole class, is selected from a whole population (like the whole

US population, or all 19-22 year-olds, etc.) The mean of the data (that

is, our Hbar ) would be called the sample mean. Now, of course, the actual

population mean height and standard deviation are almost certainly not exactly

Hbar and s . However, under some assumptions, it

is possible to make predictions about the population mean height based

on the sample mean. For instance, using a technique appropriate for

relatively large samples, we can determine an interval about the

sample mean that contains the population mean with probability .95 (i.e.

intuitively, we have a 95% chance that the population mean is contained in

the interval) -- this is called a (large sample) 95% confidence interval:

> MeanLSCI(AllHeights,.05);

Maple note: The MeanLSCI takes two inputs -- the data list,

and a number (the .05 in the example). It returns the

two endpoints of the % confidence interval

for the population mean.

The same kind of thinking would also give a way to answer a question like:

How many heights would we need to measure (selected at random from the

population) in order to be 95% sure that the population mean and the sample

mean were within .5 inch, or .25 inch, etc.

Comparing subpopulations

Another very typical kind of question that statisticians deal with is this:

Within our sample, we have actually selected individuals from two

different subpopulations, the male and the female subpopulations.

From this data (and from experience), it seems "reasonable" that those

subpopulations have somewhat different mean heights. Does our data

support that hypothesis?

> MHeights:=AllHeights[1..15];

> WHeights:=AllHeights[16..35];

> Hist(MHeights,60,79,10);

> Hist(WHeights,60,79,10);

Question: Does this data support the hypothesis that the average male is taller than

the average female?

> MM:=Mean(MHeights);

> MW:=Mean(WHeights);

> SM:=StandardDeviation(MHeights);

> SW:=StandardDeviation(WHeights);

Using the same confidence interval method as above (which is at least somewhat

questionable now, since there are only 20 women and 15 men in the subsamples,

but we'll ignore that for the moment!), we have:

> MCI:=MeanLSCI(MHeights,.05);

> WCI:=MeanLSCI(WHeights,.05);

(There are other methods for deriving confidence intervals for the subpopulation

means that would be more reliable with relatively small samples. But the results are

almost the same in this case!)

What conclusion can we draw here?