Predictive performance of machine and statistical learning methods: impact of data-generating processes on external validity in the “large N, small p” setting

Machine learning approaches are increasingly suggested as tools to improve prediction of clinical outcomes. We aimed to identify when machine learning methods perform better than a classical learning method. We hereto examined the impact of the data-generating process on the relative predictive accuracy of six machine and statistical learning methods: bagged classification trees, stochastic gradient boosting machines using trees as the base learners, random forests, the lasso, ridge regression, and unpenalized logistic regression. We performed simulations in two large cardiovascular datasets which each comprised an independent derivation and validation sample collected from temporally distinct periods: patients hospitalized with acute myocardial infarction (AMI, n = 9484 vs. n = 7000) and patients hospitalized with congestive heart failure (CHF, n = 8240 vs. n = 7608). We used six data-generating processes based on each of the six learning methods to simulate outcomes in the derivation and validation samples based on 33 and 28 predictors in the AMI and CHF data sets, respectively. We applied six prediction methods in each of the simulated derivation samples and evaluated performance in the simulated validation samples according to c-statistic, generalized R2, Brier score, and calibration. While no method had uniformly superior performance across all six data-generating process and eight performance metrics, (un)penalized logistic regression and boosted trees tended to have superior performance to the other methods across a range of data-generating processes and performance metrics. This study confirms that classical statistical learning methods perform well in low-dimensional settings with large data sets.

View Source

Information

Citation

Austin PC, Harrell FE Jr, Steyerberg EW. Stat Methods Med Res. 2021; 30(6):1465-83. Epub 2021 Apr 13.

View Source

Discover More

Journal Article

25/04/2024

Multifetal pregnancy after implementation of a publicly funded fertility program

Velez MP, Soule A, Gaudet L, Pudwell J, Nguyen P, Ray JG. JAMA Netw Open. 2024; 7(4):e248496. Epub 2024 Apr 25.

Journal Article

25/04/2024

Proportion of life spent in Canada and the incidence of multiple sclerosis in permanent immigrants

Vyas MV, Kapral MK, Rea A, Fang J, Rotstein DL. Neurology. 2024; 102(10):e209350. Epub 2024 Apr 24.

Journal Article

18/04/2024

Incidence of total knee arthroplasty after arthroscopic surgery for knee osteoarthritis: a secondary analysis of a randomized clinical trial

Birmingham TB, Primeau CA, Shariff SZ, Reid JNS, Marsh JD, Lam M, Dixon SN, Giffin JR, Willits KR, Litchfield RB, Feagan BG, Fowler PJ. JAMA Netw Open. 2024; 7(4):e246578. Epub 2024 Apr 18.

See All

Predictive performance of machine and statistical learning methods: impact of data-generating processes on external validity in the “large N, small p” setting

Information

Citation

Contributing ICES Scientists

Research Programs

Associated Sites

Discover More

Multifetal pregnancy after implementation of a publicly funded fertility program

Proportion of life spent in Canada and the incidence of multiple sclerosis in permanent immigrants

Incidence of total knee arthroplasty after arthroscopic surgery for knee osteoarthritis: a secondary analysis of a randomized clinical trial