ST 352

Steps in Multiple Regression Analysis

 

1.         Examine the correlation between the explanatory variables.  If two variables are highly correlated and both are included in the regression model, the regression analysis may not pick up the significant effect of either when in fact they are both significant predictors of the response variable.  Therefore, you may consider removing one of the variables.  To check the correlations between pairs of explanatory variables, look at:

 

            -  scatterplot matrix

            -  correlations

 

2.         Check for a non-linear trend between the response variable and EACH of the explanatory variables.  If a non-linear trend exists, consider transforming the response variable and/or explanatory variable(s).  Also, check for possible outliers.  If a possible outlier exists, try to find out more information about that point (data entry error? comes from a different population than the rest of the points?).  Useful plots:

 

            - scatterplot matrix

            - residual plot: residuals versus predicted values

            - residual plots: residuals versus each of the explanatory variables

 

3.         Check for violation of the assumptions of the multiple regression model.  Recall that the assumptions of the model are that the ERROR terms are normally distributed with a mean of zero and a constant standard deviation and are also independent.  Graphs to use:

 

            - residual plot: residuals versus predicted values (for mean of zero and constant standard deviation assumptions)

            - normal probability plot of the residuals (for normality assumption)

 

            If the assumptions are not met, consider a transformation of the response variable (usually) and go back to step 2.

 

4.         With the best-fitting model, test to see if any of the explanatory variables are useful in predicting the response variable with the Analysis of Variance F-test

(H0: )

 

If the null hypothesis is not rejected, stop the analysis!!!  None of the explanatory variables are useful in predicting the response, so there is no need to continue.

 

5.         If the null hypothesis is rejected in step 4, perform a backwards selection process to determine which variables are significant predictors of the response in the presence of the other variables.  The backwards selection process eliminates those variables which are not significant predictors of the response variable, one at a time.  This procedure involves the following steps:

a) Perform individual t-tests on each of the explanatory variables to test the

    hypothesis H0:  for i = 1, 2, …, p (where p = the number of explanatory

    variables).  Find the corresponding p-values for each test.   

b) If none of the p-values are greater than .05, then all of the explanatory

    variables are significant predictors of the response variable in the presence of

    the other variables.  STOP the backwards selection process.

c)  If there is at least one p-value that is greater than .05, drop the variable with

the largest p-value from the model, perform the multiple regression analysis with the remaining variables, and go back to step (a).

 

Continue until all p-values are less than .05.  The variables left are the significant predictors of the response variable and the regression equation should be formed from model that includes only these explanatory variables.

 

6.         Interpretation of the final regression model:

 

            a.         Write the final regression equation

i.          Interpretation of each of the coefficients in the final model:  for every unit increase in xi, the response variable is predicted to increase byunits, on average, holding all of the other explanatory variables constant.

            Note: if a transformation is done, the interpretation is a bit different.  See the notes at the end of the “Animal Gestation” handout.

b.         Prediction:  to predict the response variable for given values of each of the explanatory variables, substitute the values of x into the regression equation and solve for y.

c.         Confidence intervals for the coefficients:  bi

d.         R2:  the percentage of the variation in the response variable that is explained by the regression on the explanatory variables.

           

            R2 = SSM/SST 

 

e.         Estimate of the standard deviation of the residuals ():  recall that one of the assumptions of the multiple linear regression model is that the residuals (or error terms) have a constant standard deviation.  The estimate of that standard deviation can be found two ways (and, yes, you should know both ways!):

            i.  From the bottom of the STATGRAPHICS output for multiple linear regression,

    look for the words “Standard error of the estimate”.  This is the estimate of the

    standard deviation of the residuals.

ii.

 

 

One final note:  the word “linear” in both Simple Linear Regression and Multiple Linear Regression refers to the fact that each of the explanatory variables is linearly related to the response variable, and, hence, has a power of 1.  There are many situations where a non-linear regression might be performed.  For example, if you were to look at a scatterplot and saw a quadratic curve (like a bowl either facing right-side up or upside down), then a polynomial regression could be performed.  This involves adding a squared term to the model.  So, in a Simple Linear Regression case, the regression equation would look like this:

 

                       

 

Another situation to be aware of is where the response variable is binary.  That is, the only values the response variable can assume are 0 or 1.  In this case, a logistic regression would be performed. 

 

We will not go into details about these two types of regression, but you should be aware that they exist.  One situation we will cover is when we have categorical explanatory variables.  Stay tuned for that exciting episode of More Regression, Please!!