MATH 376 -- Probability and Statistics 2
Multiple Regression and Hypothesis testing
April 12, 2006
The following comes from problem 11.79 in our text.
We have two strains of bacteria and want to compare
their growth rates. If we do simple regression with
each data set as follows:
> | restart; |
> | read "/home/fac/little/public_html/ProbStat0506/MSP.map"; |
Warning, the name changecoords has been redefined
> | times:=[-2,-1,0,1,2]; |
> | A:=[8.0,9.0,9.1,10.2,10.4]; |
> | B:=[10.0,10.3,12.2,12.6,13.9]; |
> | Tbar:=Mean(times); |
> | Abar:=Mean(A); |
> | Bbar:=Mean(B); |
> | SXYA:=add((times[i]-Tbar)*(A[i]-Abar),i=1..5); |
> | SXXA:=add((times[i]-Tbar)^2,i=1..5); |
> | Ahatbeta[1]:=SXYA/SXXA; |
> | Ahatbeta[0]:=Abar-Ahatbeta[1]*Tbar; |
> | SXYB:=add((times[i]-Tbar)*(B[i]-Bbar),i=1..5); |
> | SXXB:=SXXA; |
> | Bhatbeta[1]:=SXYB/SXXB; |
> | Bhatbeta[0]:=Bbar-Bhatbeta[1]*Tbar; |
> | APoints:=[seq([times[i],A[i]],i=1..5)]; |
> | BPoints:=[seq([times[i],B[i]],i=1..5)]; |
> | with(plots): |
> | AP:=plot(APoints,style=point,symbol=circle,color=blue): |
> | BP:=plot(BPoints,style=point,symbol=circle,color=red): |
> | AL:=plot(Ahatbeta[0]+Ahatbeta[1]*t,t=-3..3,color=blue): |
> | BL:=plot(Bhatbeta[0]+Bhatbeta[1]*t,t=-3..3,color=red): |
> | display(AP,BP,AL,BL); |
Note: From this plot, it seems that the lines have different slopes , indicating different growth rates .
How can we test whether this is statistically significant though ? Here is one method, where we
incorporate both data sets into a single set, but include a "category variable" indicating
which data set the point came from. Let
if the measurement is from type A, and
if the measurement is from type B. Then
will indicate the time, and Y will
be the number of bacteria. The following type of linear model will
allow us to fit the straight lines for both data types, and compare their slopes:
Y =
+
Why do we do it this way? Note: if
(type A), we get
On the other hand if
(type B), then we get
To compare the two slopes we just want to look at
=
. We know
how to set up a test of
:
versus
:
based on our general
discussions(!) Now we go to the matrix formulation for the multiple regression
model. We will set up X putting the data points for type A first, then type B,
but any order would give equivalent results:
> | with(linalg): |
Warning, the protected names norm and trace have been redefined and unprotected
> | X:=matrix([[1,0,-2,0],[1,0,-1,0],[1,0,0,0],[1,0,1,0],[1,0,2,0],[1,1,-2,-2],[1,1,-1,-1],[1,1,0,0],[1,1,1,1],[1,1,2,2]]): |
> | Y:=matrix([[8.0],[9.0],[9.1],[10.2],[10.4],[10.0],[10.3],[12.2],[12.6],[13.9]]): |
The normal equations:
> | XtX:=multiply(transpose(X),X); |
> | XtY:=multiply(transpose(X),Y); |
> | beta:=linsolve(XtX,XtY); |
Note that, contained here are the values for the two slopes we saw before(!)
The estimate for
Note this is (0,0,0,1)
, so we take
in our general formulas:
> | a:=matrix([[0],[0],[0],[1]]): |
> | S2:=1/(10-(3+1))*(multiply(transpose(Y),Y)[1,1] - multiply(multiply(transpose(beta),transpose(X)),Y)[1,1]); |
> | t:=beta[4,1]/sqrt(S2*multiply(transpose(a),multiply(inverse(XtX),a))[1,1]); |
With 10 - (4+1) = 5 d.f. this gives a p - value less than 0.05:
> | 2*(1 - TCDF(5,t)); |
> |
So at
= .05 level, for instance, this would be sufficient evidence to reject
and conclude
. As above, this indicates that the rates of growth
of the two strains of bacteria are different.