Mathematics 375 -- Probability and Statistics I
Descriptive Statistics -- Maple probability and statistics package demo
September 5, 2003
> | read "/home/fac/little/public_html/ProbStat/MaplePackage/MSP.map"; |
The whole class height data, entered as a Maple list data structure.
The first 15 entries are the 15 men in the class; the next 20 are the
20 women.
> | AllHeights:=evalf([68,73,69,68,72,69,71,73,68,72,78,70,70,75,71,67,68,62,64,62,60,66,66,68,69,68,68,68,65,64,72,64,64,64,66]); |
Maple note: AllHeights is a Maple list with real number entries. The evalf
takes the integers on the input line and turns them into decimal numbers.
You could also enter all (or some) of the numbers with decimal points and
get the same results. If we didn't do that, Maple would try to express
the mean, standard deviation, etc. exactly (as fractions with square
roots, rather than in decimal form).
> | nops(AllHeights); |
The relative frequency histogram for the whole-class height data:
> | Hist(AllHeights,60,79,20); |
> | Hbar:=Mean(AllHeights); |
> | s:=StandardDeviation(AllHeights); |
Normal distributions
Many quantities such as heights in a population tend to be distributed ``normally,''
with a distribution described by what is called a normal probability density
function,
which include parameters
= population mean, and
= population standard deviation. We'll study the normal density functions
in great detail later this semester. For now, here is the normal probability density
with mean Hbar = 68.71794... and standard deviation s = 4.07139 ... plotted
together with the relative frequency histogram of the heights:
> | with(plots): |
Warning, the name changecoords has been redefined
> | HH:=Hist(AllHeights,60,79,20): |
> | DF:=plot(x->NormalPDF(Hbar,s,x),55..84,color=blue): |
> | display(HH,DF); |
Maple note: This is the standard method for combining two or more plots
into a single plot. The display procedure is contained in the plots
package, which is loaded with the with command. Execute the separate
plotting commands one at a time and assign the values to names. Then
display the plots together using display.
The "empirical rule"
A "quick and dirty" approximation known as the "empirical rule" says that
for a quantity that is normally distributed with mean
and standard
deviation
there is a 68% chance that a randomly selected sample
will be within 1 standard deviation of the mean (i.e .between
and
) and a 95% chance that a randomly selected sample
will be within 2 standard deviations of the mean (i.e. between
and
). How well does that fit our data?
> | OneSD:=op(Frequencies(AllHeights,Hbar-s,Hbar+s,1)); |
> | TwoSD:=op(Frequencies(AllHeights,Hbar-2*s,Hbar+2*s,1)); |
> | evalf(OneSD/nops(AllHeights)); |
> | evalf(TwoSD/nops(AllHeights)); |
Not too bad!! Even if a data set is not exactly normally distributed,
the empirical rule can often yield reasonable estimates.
Maple note: The Frequencies procedure can be used to count the
number of numbers in a list in any number n of "bins" on any interval
[a,b] . The general format is Frequencies(list,a,b,n);
The output of this is a list with n entries. Here, I made n = 1,
so to get that one number alone (not as a list with one entry), I
used the Maple op built-in function. "op" of a list is the sequence
of entries in the list, separated by commas if there are more than
one.
A "look ahead" to topics we will study later in this semester and in the spring
Often, we will consider problems where a random sample, like the height
data for the whole class, is selected from a whole population (like the whole
US population, or all 19-22 year-olds, etc.) The mean of the data (that
is, our Hbar ) would be called the sample mean. Now, of course, the actual
population mean height and standard deviation are almost certainly not exactly
Hbar and s . However, under some assumptions, it
is possible to make predictions about the population mean height based
on the sample mean. For instance, using a technique appropriate for
relatively large samples, we can determine an interval about the
sample mean that contains the population mean with probability .95 (i.e.
intuitively, we have a 95% chance that the population mean is contained in
the interval) -- this is called a (large sample) 95% confidence interval:
> | MeanLSCI(AllHeights,.05); |
Maple note: The MeanLSCI takes two inputs -- the data list,
and a number
(the .05 in the example). It returns the
two endpoints of the
% confidence interval
for the population mean.
The same kind of thinking would also give a way to answer a question like:
How many heights would we need to measure (selected at random from the
population) in order to be 95% sure that the population mean and the sample
mean were within .5 inch, or .25 inch, etc.
Comparing subpopulations
Another very typical kind of question that statisticians deal with is this:
Within our sample, we have actually selected individuals from two
different subpopulations, the male and the female subpopulations.
From this data (and from experience), it seems "reasonable" that those
subpopulations have somewhat different mean heights. Does our data
support that hypothesis?
> | MHeights:=AllHeights[1..15]; |
> | WHeights:=AllHeights[16..35]; |
> | Hist(MHeights,60,79,10); |
> | Hist(WHeights,60,79,10); |
Question: Does this data support the hypothesis that the average male is taller than
the average female?
> | MM:=Mean(MHeights); |
> | MW:=Mean(WHeights); |
> | SM:=StandardDeviation(MHeights); |
> | SW:=StandardDeviation(WHeights); |
Using the same confidence interval method as above (which is at least somewhat
questionable now, since there are only 20 women and 15 men in the subsamples,
but we'll ignore that for the moment!), we have:
> | MCI:=MeanLSCI(MHeights,.05); |
> | WCI:=MeanLSCI(WHeights,.05); |
(There are other methods for deriving confidence intervals for the subpopulation
means that would be more reliable with relatively small samples. But the results are
almost the same in this case!)
What conclusion can we draw here?