Mathematics 375 -- Probability and Statistics I
Descriptive Statistics -- Maple probability and statistics package demo
September 2, 2005
> | read "/home/fac/little/public_html/ProbStat0506/MSP.map"; |
Warning, the name changecoords has been redefined
The whole class height data, entered as a Maple list data structure.
The first 6 entries are the 6 men in the class; the next 22 are the
22 women.
> | AllHeights:=evalf([65,68,68,71,74,74,58,62,62,62,63,63,63,64,64,65,65,65,65,65,66,66,66,66,66,68,68,69]); |
Maple note: AllHeights is a Maple list with real number entries. The evalf
takes the integers on the input line and turns them into decimal numbers.
You could also enter all (or some) of the numbers with decimal points and
get the same results. If we didn't do that, Maple would try to express
the mean, standard deviation, etc. exactly (as fractions with square
roots, rather than in decimal form).
> | nops(AllHeights); |
The relative frequency histogram for the whole-class height data:
> | Hist(AllHeights,58,74,8); |
> | Hbar:=Mean(AllHeights); |
> | s:=StandardDeviation(AllHeights); |
Normal distributions
Many quantities such as heights in a population tend to be distributed ``normally,''
with a distribution described by what is called a normal probability density
function, which include parameters = population mean, and
= population standard deviation. We'll study the normal density functions
in great detail later this semester. For now, here is the normal probability density
with mean Hbar = 68.057... and standard deviation s = 3.8877... plotted
together with the relative frequency histogram of the heights:
> | with(plots): |
> | HH:=NormHist(AllHeights,58,74,8): |
> | DF:=plot(x->NormalPDF(Hbar,s,x),55..84,color=blue): |
> | display(HH,DF); |
Maple note: This is the standard method for combining two or more plots
into a single plot. The display procedure is contained in the plots
package, which is loaded with the with command. Execute the separate
plotting commands one at a time and assign the values to names. Then
display the plots together using display.
The "empirical rule"
A "quick and dirty" approximation known as the "empirical rule" says that
for a quantity that is normally distributed (or perhaps just approximately
normal ) with mean and standard deviation there is a 68% chance
that a randomly selected sample will be within 1 standard deviation of
the mean (i.e .between and ) and a 95% chance that a
randomly selected sample will be within 2 standard deviations of the
mean (i.e. between and ). How well does that fit our data?
> | OneSD:=op(Frequencies(AllHeights,Hbar-s,Hbar+s,1)); |
> | TwoSD:=op(Frequencies(AllHeights,Hbar-2*s,Hbar+2*s,1)); |
> | evalf(OneSD/nops(AllHeights)); |
> | evalf(TwoSD/nops(AllHeights)); |
We'll see later that N = 28 is a relatively small sample size for these
purposes. .75 is still relatively close to .68 and .89 isn't too far off from
.95.
Maple note: The Frequencies procedure can be used to count the
number of numbers in a list in any number n of "bins" on any interval
[a,b] . The general format is Frequencies(list,a,b,n);
The output of this is a list with n entries. Here, I made n = 1,
so to get that one number alone (not as a list with one entry), I
used the Maple op built-in function. "op" of a list is the sequence
of entries in the list, separated by commas if there are more than
one.
A "look ahead" to topics we will study later in this semester and in the spring
Often, we will consider problems where a random sample, like the height
data for the whole class, is selected from a whole population (like the whole
US population, or all 19-22 year-olds, etc.) The mean of the data (that
is, our Hbar ) would be called the sample mean. Now, of course, the actual
population mean height and standard deviation are almost certainly not exactly
Hbar and s . However, under some assumptions, it
is possible to make predictions about the population mean height based
on the sample mean. For instance, using a technique appropriate for
relatively large samples, we can determine an interval about the
sample mean that contains the population mean with probability .95 (i.e.
if we sampled 100 times, the computed interval would contain the
population mean about 95 times) -- this is called a (large sample) 95%
confidence interval:
> | MeanLSCI(AllHeights,.05); |
Maple note: The MeanLSCI takes two inputs -- the data list,
and a number (the .05 in the example). It returns the
two endpoints of the % confidence interval
for the population mean.
The same kind of thinking would also give a way to answer a question like:
How many heights would we need to measure (selected at random from the
population) in order to be 95% sure that the population mean and the sample
mean were within .5 inch, or .25 inch, etc.
Comparing subpopulations
Another very typical kind of question that statisticians deal with is this:
Within our sample, we have actually selected individuals from two
different subpopulations, the male and the female subpopulations.
From this data (and from experience), it seems "reasonable" that those
subpopulations have somewhat different mean heights. Does our data
support that hypothesis?
> | MHeights:=AllHeights[1..6]; |
> | WHeights:=AllHeights[7..28]; |
> | NormHist(MHeights,65,74,4); |
> | NormHist(WHeights,58,69,5); |
Question: Does this data support the hypothesis that the average male is taller than
the average female?
> | MM:=Mean(MHeights); |
> | MW:=Mean(WHeights); |
> | SM:=StandardDeviation(MHeights); |
> | SW:=StandardDeviation(WHeights); |
Using the same confidence interval method as above (which is at least somewhat
questionable now, since there are only 22 women in the subsample, but we'll
ignore that for the moment!), we have:
> | WCI:=MeanLSCI(WHeights,.05); |
(There are other methods for deriving confidence intervals for the subpopulation
means that would be more reliable with relatively small samples. )
What conclusion can we draw here?