ST 352
Simple Linear Regression
The Animal Gestation problem
In this example, we’ll walk through a complete simple linear regression analysis. This example will involve doing an analysis after a transformation.
The following data lists the average gestation period (in days) and longevity (average life expectancy in years) for a sample of animals, as reported in The 1993 World Almanac and Book of Facts.
animal gestation longevity animal gestation longevity
baboon 187 20 guinea pig 68 4
black bear 219 18 hippopotamus 238 25
grizzly bear 225 25 horse 330 20
polar bear 240 20 kangaroo 42 7
beaver 122 5 leopard 98 12
buffalo 278 15 lion 100 15
camel 406 12 monkey 164 15
chimpanzee 231 12 moose 240 12
cat 63 20 mouse 21 3
chipmunk 31 6 opossum 15 1
cow 284 15 pig 112 10
deer 201 8 puma 90 12
dog 61 12 rabbit 31 5
donkey 365 12 rhinoceros 450 15
elephant 645 40 sea lion 350 12
elk 250 15 sheep 154 12
fox 52 7 squirrel 44 10
giraffe 425 10 tiger 105 16
goat 151 8 wolf 63 5
gorilla 257 20 zebra 365 15
1. What is the response and what is the explanatory variable?
Step 1: Determine if a linear relationship exists between longevity and gestation and identify any possible
outliers:

2. Is the relationship between longevity and gestation linear?
3. Are there any outliers? If so, identify the animal.
Step 2: Determine if the assumptions of the model are met.

4. Which assumption(s) seem to be violated? Why?
5. What should be done?
Steps 3 and 4: A log transformation of the response was done. After doing the transformation, we go back to Step 1:
Determine if a linear relationship exists between longevity and log(gestation). (Note: we do not have to go back and consider outliers again – we already did that!)

6. Does the relationship between longevity and log(gestation) appear to be fairly linear? How is(are) the outlier(s) influencing the linearity of this relationship?
Step 2:Determine if the assumptions of the model are met:

7. Do the assumptions of the model appear to be met? Are they better met than on the original scale?
Steps 5 & 6: With the best-fitting model, determine if the explanatory variable helps to predict the response.
Here is the Plot of the Fitted Model (scatterplot with the least-squares regression line drawn on it)

Use the STATGRAPHICS output below to answer the following questions:
Simple Regression - log(gestation) vs. longevity
Regression Analysis - Linear model: Y = a + b*X
-----------------------------------------------------------------------------
Dependent variable: log(gestation)
Independent variable: longevity
-----------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
-----------------------------------------------------------------------------
Intercept 3.8096 0.227308 16.7596 0.0000
Slope 0.0855573 0.0151843
-----------------------------------------------------------------------------
Analysis of Variance
-----------------------------------------------------------------------------
Source Sum of Squares Df Mean Square F-Ratio P-Value
-----------------------------------------------------------------------------
Model 14.9849 1 14.9849 31.75 0.0000
Residual 17.9354 38 0.471985
-----------------------------------------------------------------------------
Total (Corr.) 32.9203 39
Correlation Coefficient = 0.674675
R-squared =
R-squared (adjusted for d.f.) = 44.0849 percent
Standard Error of Est. = 0.687012
Mean absolute error = 0.560136
Durbin-Watson statistic = 1.93505 (P=0.4144)
Lag 1 residual autocorrelation = 0.0119836
8. Does longevity (life expectancy) of animals help explain gestation period for these animals? Write the null and alternative hypotheses, calculate the appropriate test-statistic (with degrees of freedom), find the p-value, and write a sentence answering the question.
9. Write the regression equation in the context of this problem.
10. Predict gestation for an animal with a life expectancy of 17 years.
11. What percent of the variation in log(gestation) is explained by the regression line?
Notes:
1) When doing a log transformation of the response variable, the interpretation of the slope (and y-intercept) becomes a bit more difficult. For this problem, the interpretation is as follows: a one-year increase in life expectancy of an animal is associated with a multiplicative change in the median gestation period of e.0856 (or 1.09). In other words, the median gestation period for a life expectancy of 17 years is about 1.09 times longer than for a life expectancy of 16 years.
2) When doing a log transformation, a confidence interval for the slope can be started in the same way as if there was no transformation done, but it is finished in a slightly different way:
95% confidence interval for
: .0856
(2.042)(.0152) = (.0546 , .1166)
to finish: (e.0546 , e.1166) = (1.05 , 1.12)
The interpretation: we are 95% confident that the median gestation period will be between 1.05 to 1.12 times longer for every increase of one year in life expectancy.
3) The standard error of the estimate (estimate of the residuals) is also on the log scale. We won’t try to put it back on the original scale since on the original scale (original data), since the assumption of constant variation was violated.