ST 352
ASSIGNMENT #4 – 40 points
Summer 2002
Due: beginning of lecture on August 15th
Botanists at the University of Toronto conducted a series of experiments to investigate the feeding habits of
baby snow geese (Journal of Applied Ecology, Vol. 32, 1995). Goslings were deprived of food until their guts
were empty, then were allowed to feed for 6 hours on a diet of plants or Purina Duck Chow. For each
feeding trial, the change in the weight of the gosling after 2.5 hours was recorded as a percentage of initial
weight. Two other variables recorded were digestion efficiency (measured as a percentage) and amount of
acid-detergent fiber in the digestive tract (also measured as a percentage). The data for 42 feeding trials are
in the data set SNOWGEESE on the g drive in Milne 201.
The botanists were interested in predicting weight change (y) as a function of digestion efficiency (x1) and
acid-detergent fiber (x2). Answer the following questions:
1) What is the response variable? What are the explanatory variables.
2) The first steps in a multiple linear regression analysis is to make sure a linear relationship exists between the response variable and each of the explanatory variables and that the explanatory variables are NOT highly correlated with each other. To do this,obtain a scatter plot matrixfrom STATGRAPHICS.
a) Does there appear to be a linear relationship between weight change and EACH explanatory variable? Are there any possible outliers? Explain.
b) Do the explanatory variables appear to be highly correlated? Explain. (Note: the actual correlation coefficient between the two explanatory variables may be helpful to answer this question. See the commands to learn how to do this.)
3) Assume the two explanatory variables are NOT highly correlated (which may or may not be what you found in number 2 above). The next step would be to check the assumptions of the multiple linear regression model. To do this, obtain the following from STATGRAPHICS:
- normal probability plot of the residuals
- residual plot of residuals versus predicted values
a) What are the assumptions of the multiple linear regression model?
b) Do these assumptions appear to be met? Explain.
4) Regardless of whether you believe the assumptions of the model are met or not, let’s continue under the assumption that they ARE met. From STATGRAPHICS, obtain the multiple regression analysis output.
a) Write the least-squares regression equation specific to this problem. Explain the terms in
the equation.
b) Interpret each of the coefficients and the constant term in terms of the problem.
c) Conduct a test to determine if ANY of the explanatory variables are useful in predicting
weight change. Write the null and alternative hypothesis. Give the appropriate test-statistic (with degrees of freedom). State the p-value and write a sentence answering the question, “Do any of the explanatory variables help predict weight change?”
d) If at least one of the explanatory variables predicts weight change, conduct a test on both of the explanatory variables to determine if one or both are predictors of weight change. State the null and alternative hypotheses for each test. Give the appropriate test-statistic (with degrees of freedom) and the p-value for both tests. Write a conclusion.
e) Give a 99%
confidence interval for
, the coefficient for acid-detergent fiber. Interpret
the result.
5) If one of the explanatory variables is not a significant predictor of weight change, remove this variable from the model and rerun the analysis. Write the equation of the “new” least-squares regression line. Has the coefficient of the remaining variable changed? Why or why not? Interpret the coefficient of the remaining variable in this new equation.
The owner of an apartment building in Minneapolis believed that her property tax bill was too high because of
an overassessment of the property’s value by the city tax assessor. The owner hired an independent real
estate appraiser to investigate the appropriateness of the city’s assessment. The appraiser used regression
analysis to explore the relationship between the sale prices of apartment buildings sold in Minneapolis and
various characteristics of the properties. Twenty-five apartment buildings were randomly sampled from all
apartment buildings that were sold during a recent year. The data can be found in the MNSALES file. The
documentation is listed below.
Documentation for MNSALES file:
Code: identification code of the building (this is not important in the analysis)
Sale Price: the sale price (or market value) of the apartment building in dollars
No. Apartments: number of apartments in the building
Age: age of the apartment building in years
Lot Size: size of the lot on which the apartment building lies in square feet
parking: number of on-site parking spaces
building area: Gross building area in square feet
The real estate appraiser hypothesized that the sale price (that is, market value) of an apartment
building is related to the other variables in the data set. Carry out an analysis to determine which of the explanatory variables are significant predictors of sales price. Write a short report that includes:
- the final least-squares regression equation.
- a check and discussion of the model assumptions. Include only appropriate graphical displays, and
make sure you refer to them in your report. Also include the estimate of the standard deviation of
the residuals. If a transformation was done, this is where you would explain the transformation you
did and why.
- a discussion about any unusual data points.
- interpretations of the coefficients of the significant explanatory variables.
- R-squared (and what it means).
STATGRAPHIC COMMANDS
Multiple Linear Regression
From the main menu: Describe … Numeric Data … Multiple-Variable Analysis.
Select the response variable and all of the explanatory variables and place them in the “Data” box. Click OK.
The graph that appears on the right is the pairwise scatter plot of each variable versus each other variable.
One of the Tabular Options is Correlations. This will give you the correlation coefficients between variables.
From the main menu: Relate … Multiple Regression.
In the dialog box, select the response variable for Dependent Variable and all of the explanatory variables for Independent Variables. Click OK. The Analysis Summary will appear in the left window pane.
By clicking the Tabular Options button in the Analysis Tool Bar, other useful information can be obtained. Useful options include:
Confidence Intervals: This option will provide 95% confidence intervals for the coefficients and the constant term. You can change the level of confidence by using the Pane Options.
Unusual Residuals and Influential Points may be helpful in identifying influential observations. The points listed are based on certain formulas (not discussed in this class) and should be used as a guideline to identify possible outliers and influential observations.
Note: The Correlation Matrix option DOES NOT give you the simple correlation coefficients between explanatory variables. To get the correlation coefficients between explanatory variables, see the Scatter plots and Correlations section above.
Note: To do a “backwards selection” process, right click with the cursor in the Analysis Summary window pane. Select Backwards Selection and click OK. The final model will be displayed. Note that all the options in the Tabular Options as well as the graphical displays from the Graphical Options will provide output based on the final model from the backwards selection process. Also note that using this option will give you the final model and which variable is being eliminated at each step, but it does NOT provide the p-value from the t-test at each step. You can also choose to do the backwards selection by removing one explanatory variable at a time and re-running the multiple regression analysis with the remaining variables so that you get the p-value.
Graphical Displays can be obtained by clicking the Graphical Options button in the Analysis Tool Bar. The useful graphical displays include Residuals versus Predicted and Residuals versus X (the Residual Plot used for the assessment of the constant variance assumption).
Note #1: In any of the Residual Plots, Studentized Residuals can be replaced by residuals by right clicking, selecting Pane Options, and then checking “residuals” in the dialog box.
Note #2: A Residual Plot of the residuals versus each of the explanatory variables can be obtained from the Residuals versus X. Simply right click and select Pane Options. Near the bottom of the dialog box is a box called “Plot versus:” with all of the explanatory variables listed. Select the desired explanatory variable and click OK.
{Continued on back}
First, go back to your Multiple Linear Regression Analysis. Click the Save Results button (to the right of the Graphical Options button). Select “Residuals” and click OK. The residuals will now be saved as an additional column on your spreadsheet.
From the main menu, select Plot … Exploratory Plots … Normal Probability Plot.
Select RESIDUALS for the Data box and click OK.