Date of Award

1989

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Biometry

College

College of Graduate Studies

First Advisor

Chan F. Lam

Second Advisor

Hurshell H. Hunt

Third Advisor

Rebecca G. Knapp

Fourth Advisor

Richard A. Schmiedt

Fifth Advisor

J. David Whitaker

Abstract

Linear regression analysis often involves considering a large number of potential variables of which only a subset are actually used in the regression equation. Many techniques for selecting subsets of variables for regression have been proposed and used. The efficiency of such techniques has been studied. However, no comprehensive comparison of the various regression procedures is available. The purpose of this study is to use simulated data to evaluate the predictive ability of the regression models generated by various variable selection techniques. The variable selection techniques considered are all possible regressions criteria (Akaike’s information criterion, Sawa's Bayesian information criterion, Mallows' C statistic, the mean square error of prediction, Amemiya's prediction criterion, Schwarz's criterion, Hocking's S statistic, the F statistic, the coefficient of multiple determination), stepwise regression based on the F statistic with alpha ranging from .01 to .15, and a recently proposed two-stage procedure based on repeated stepwise regression with various alpha levels. Seventy six experimental conditions, which vary in the number of observations (12 and 25), the number of potential predictor variables (5-50), the vector of regression coefficients (2 and 6 coefficients), the variance of the error term (1 and .25), and the level of correlation among the predictor variables (0-.8), are considered. Four different measures are used to compare the “predictive ability” of the variable selection techniques. One of these measures, original to this study, involves predicting another sample from the same underlying distribution as the original sample. Results based on this new and more realistic measure are often but not always consistent with those from the previously used measures of predictive ability. Repeated measures analyses of variance (with the variable selection criteria as the repeated measure) and Tukey's studentized range tests of all possible pairwise comparisons are used to assess differences in the mean values of the measures of predictive ability among the variable selection criteria within each of six combinations of number of observations, n; and number of potential predictor variables, p. These tests result in the following conclusions. In cases where n is larger than p, all the criteria perform equally well with the exception of the F statistic, the coefficient of multiple determination and stepwise regression with alpha = .01, all of which performed significantly worse than the other criteria. In situations where n and p are close to the same size, most of the all possible regressions criteria perform well. Sawa's Bayesian information criterion and Schwartz's criterion in particular, perform weIl in the experimental conditions considered. When p is larger than n, the all possible regressions approach is not applicable. In this situation the two-stage criteria perform well; however, never statistically better than at least one of the stepwise criterion. In all situations considered but particularly when p is larger than n, the performance of the stepwise regression criteria was erratic and very sensitive to the alpha level employed. This was also true of the two-stage procedure, where both stages are based on stepwise regression criteria. Exploration of the performance of the two-stage approach with the second stage based on all possible regressions criteria rather than stepwise regression criteria is suggested.

Rights

All rights reserved. Copyright is held by the author.

Share

COinS