Functions > Design of Experiments > Regression Analysis > Example: Identifying Influential Observations
Example: Identifying Influential Observations
Use function polyfitstat to test the observations used to create a multivariate polynomial regression.
1. Define a matrix that contains samples taken from 20 healthy individuals between 25 and 34 years old.
Click to copy this expression
The columns of the matrix represent the triceps skinfold thickness, the thigh circumference, and the body fat, respectively, for each of the 20 individuals.
Click to copy this expression
Click to copy this expression
Click to copy this expression
2. Use functions rows and cols to calculate the number of rows and columns.
Click to copy this expression
Click to copy this expression
3. Use the augment function to define matrix BM as the first two columns of the matrix. Use these measurements to predict the amount of fat in individuals.
Click to copy this expression
4. Call polyfitstat to model the experiment by a first-order regression and to calculate the regression statistics.
Click to copy this expression
5. Display the regression coefficients matrix found in row 7 of output matrix P.
Click to copy this expression
6. Calculate the regression coefficients using matrix calculation.
Click to copy this expression
Click to copy this expression
Click to copy this expression
Click to copy this expression
Click to copy this expression
Click to copy this expression
The coefficients of regression fit the following equation:
Click to copy this expression
Click to copy this expression
7. Use the regression equation to calculate the predicted amount of body fat. Compare these values to the amount of body fat measured.
Click to copy this expression
Click to copy this expression
8. Use the submatrix function to display model statistics found in the first rows of output matrix P.
Click to copy this expression
The first statistic is the standard deviation for the regression.
Click to copy this expression
9. Display the model diagnostics found in the last nested matrix of output matrix P.
Click to copy this expression
The observed and the predicted values correspond to the values displayed in step 7. The residuals are the difference between the observed and the predicted values:
10. Calculate the residuals, or the difference between the observed and the predicted values.
Click to copy this expression
11. Use function diag to calculate the leverage values, or the diagonal values, of matrix H.
Click to copy this expression
12. Calculate the studentized residuals.
Click to copy this expression
The externally studentized residuals, or R-student, are calculated below. Constant p is the number of coefficients calculated for the regression and S2 is an estimate of s2 based on a data set where the ith observation is removed.
Click to copy this expression
Click to copy this expression
Click to copy this expression
13. Use Cook's distance to measure the overall influence of a deleted point on a linear regression.
Click to copy this expression
14. Calculate the difference between predicted values when all the observations are included in the fit and predicted values when the ith observation is omitted
Click to copy this expression
15. Use functions augment and stack to display the above statistics.
Click to copy this expression
Click to copy this expression
Click to copy this expression
Click to copy this expression
Click to copy this expression
16. Use functions max and qt to determine the largest R-student value in the data set. Use the Bonferroni test to decide if the corresponding observation is an outlier.
Click to copy this expression
Click to copy this expression
Click to copy this expression
The largest R-student value is smaller than the Bonferroni test which indicates that the corresponding observation is not an outlier.
17. Determine the maximum DFFITS value.
Click to copy this expression
The value is larger than 1, but it is close enough to 1 that the corresponding observation is not necessarily influential.
18. Create an index influence plot by plotting the Cook's distances against the runs.
Click to copy this expression
The observation for run2 is the most influential observation of the data set. The observation for run12 identified in step 10 is also influential but much less than that for run2.
Reference
Reference: Neter, J., Kutner, M.H., Nachtsheim, C.J., Wasserman, W., Applied Linear Statistical Models, 4th ed., McGraw-Hill/Irwin, Boston, 1996, pp. 375
Was this helpful?