Mathematics 375 -- Probability and Statistics I

Descriptive Statistics -- Maple probability and statistics package demo

September 2, 2005

>    read "/home/fac/little/public_html/ProbStat0506/MSP.map";

Warning, the name changecoords has been redefined

49399506

The whole class height data, entered as a Maple list data structure.

The first 6 entries are the 6 men in the class; the next 22 are the

22 women.

>    AllHeights:=evalf([65,68,68,71,74,74,58,62,62,62,63,63,63,64,64,65,65,65,65,65,66,66,66,66,66,68,68,69]);

AllHeights := [65., 68., 68., 71., 74., 74., 58., 62., 62., 62., 63., 63., 63., 64., 64., 65., 65., 65., 65., 65., 66., 66., 66., 66., 66., 68., 68., 69.]
AllHeights := [65., 68., 68., 71., 74., 74., 58., 62., 62., 62., 63., 63., 63., 64., 64., 65., 65., 65., 65., 65., 66., 66., 66., 66., 66., 68., 68., 69.]

Maple note:   AllHeights is a Maple list with real number entries.   The evalf  

takes the integers on the input line and turns them into decimal numbers.

You could also enter all (or some) of the numbers with decimal points and

get the same results.  If we didn't do that, Maple would try to express

the mean, standard deviation, etc. exactly (as fractions with square

roots, rather than in decimal form).

>    nops(AllHeights);

28

The relative frequency histogram for the whole-class height data:

>    Hist(AllHeights,58,74,8);

[Maple Plot]

>    Hbar:=Mean(AllHeights);

Hbar := 65.75000000

>    s:=StandardDeviation(AllHeights);

s := 3.492054473

Normal distributions

Many quantities such as heights in a population tend to be distributed ``normally,''  

with a distribution described by what is called a normal probability density

function, which include parameters mu  = population mean, and

sigma  = population standard deviation.  We'll study the normal density functions

in great detail later this semester.   For now, here is the normal probability density

with mean Hbar  = 68.057... and standard deviation s  = 3.8877... plotted

together with the relative frequency histogram of the heights:

>    with(plots):

>    HH:=NormHist(AllHeights,58,74,8):

>    DF:=plot(x->NormalPDF(Hbar,s,x),55..84,color=blue):

>    display(HH,DF);

[Maple Plot]

Maple note:   This is the standard method for combining two or more plots

into a single plot.  The display procedure is contained in the   plots

package, which is loaded with the   with   command.  Execute the separate

plotting commands one at a time and assign the values to names.  Then

display the plots together using   display.

The "empirical rule"

A "quick and dirty" approximation known as the "empirical rule"  says that

for a quantity that is normally distributed (or perhaps just approximately

normal )   with mean   mu   and standard deviation   sigma   there is a 68% chance

that a randomly selected sample will be within 1 standard deviation of

the mean  (i.e .between   mu-sigma  and mu+sigma )  and a 95% chance that a

randomly selected sample will be within 2 standard deviations of the

mean (i.e. between   mu-2*sigma   and   mu+2*sigma ).   How well does that fit our data?

>    OneSD:=op(Frequencies(AllHeights,Hbar-s,Hbar+s,1));

OneSD := 21

>    TwoSD:=op(Frequencies(AllHeights,Hbar-2*s,Hbar+2*s,1));

TwoSD := 25

>    evalf(OneSD/nops(AllHeights));

>    evalf(TwoSD/nops(AllHeights));

.7500000000

.8928571429

We'll see later that N = 28 is a relatively small sample size for these

purposes.  .75 is still relatively close to .68 and  .89 isn't too far off from

.95.  

Maple note:   The Frequencies procedure can be used to count the

number of numbers in a list in any number n  of "bins" on any interval

[a,b] .   The general format is   Frequencies(list,a,b,n);  

The output of this is a list with   n   entries.  Here, I made   n = 1,

so to get that one number alone (not as a list with one entry), I

used the Maple   op   built-in function.  "op" of a list is the sequence

of entries in the list, separated by commas if there are more than

one.

A "look ahead" to topics we will study later in this semester and in the spring

Often, we will consider problems where a random sample, like the height

data for the whole class, is selected from a whole population (like the whole

US population, or all 19-22 year-olds, etc.)  The mean of the data (that

is, our Hbar )  would be called the sample mean.   Now, of course, the actual

population mean height and standard deviation are almost certainly not exactly

Hbar  and s .    However, under some assumptions,  it

is possible to make predictions about the population mean height based

on the sample mean.  For instance, using a technique appropriate for

relatively large  samples, we can determine an interval about the

sample mean that contains the population mean with probability  .95  (i.e.

if we sampled 100 times, the computed interval would contain the

population mean about 95 times) -- this is called a (large sample) 95%

confidence interval:

>    MeanLSCI(AllHeights,.05);

64.45654869, 67.04345131

Maple note:   The MeanLSCI takes two inputs -- the data list,

and a number alpha   (the .05 in the example).  It returns the

two endpoints of the   (1-alpha)*x*100 %  confidence interval

for the population mean.

The same kind of thinking would also give a way to answer a question like:

How many  heights  would we need to measure (selected at random from the

population) in order to be 95% sure that the population mean and the sample

mean were within  .5 inch,  or .25 inch, etc.  

Comparing subpopulations

Another very typical kind of question that statisticians deal with is this:

Within our sample, we have actually selected individuals from two

different subpopulations, the male and the female subpopulations.

From this data (and from experience), it seems "reasonable" that those

subpopulations have somewhat different mean heights.  Does our data

support that hypothesis?

>    MHeights:=AllHeights[1..6];

MHeights := [65., 68., 68., 71., 74., 74.]

>    WHeights:=AllHeights[7..28];

WHeights := [58., 62., 62., 62., 63., 63., 63., 64., 64., 65., 65., 65., 65., 65., 66., 66., 66., 66., 66., 68., 68., 69.]

>    NormHist(MHeights,65,74,4);

[Maple Plot]

>    NormHist(WHeights,58,69,5);

[Maple Plot]

Question:  Does this data support the hypothesis that the average male is taller than

the average female?  

>    MM:=Mean(MHeights);

MM := 70.00000000

>    MW:=Mean(WHeights);

MW := 64.59090909

>    SM:=StandardDeviation(MHeights);

SM := 3.633180425

>    SW:=StandardDeviation(WHeights);

SW := 2.442853425

Using the same confidence interval method as above (which is at least somewhat

questionable now, since there are only 22 women in the subsample, but we'll

ignore that for the moment!), we have:

>    WCI:=MeanLSCI(WHeights,.05);

WCI := 63.57012437, 65.61169381

(There are other methods for deriving confidence intervals for the subpopulation

means that would be more reliable with relatively small samples. )  

What conclusion can we draw here?