Prior studies have shown that automated variable selection results in models with substantially inflated estimates of the model R2, and that a large proportion of selected variables are truly noise variables. These earlier studies used simulated data sets whose sample sizes were at most 100.
The authors used Monte Carlo simulations to examine the large-sample performance of backwards variable elimination. They found that in large samples, backwards variable elimination resulted in estimates of R2 that were at most marginally biased. However, even in large samples, backwards elimination tended to identify the correct regression model in a minority of the simulated data sets.
Research and statistical methods