Go to content

Comparison of methods for tuning machine learning model hyper-parameters: with application to predicting high-need high-cost health care users

Share

Background — Supervised machine learning is increasingly being used to estimate clinical predictive models. Several supervised machine learning models involve hyper-parameters, whose values must be judiciously specified to ensure adequate predictive performance.

Objective — To compare several (nine) hyper-parameter optimization (HPO) methods, for tuning the hyper-parameters of an extreme gradient boosting model, with application to predicting high-need high-cost health care users.

Methods — Extreme gradient boosting models were estimated using a randomly sampled training dataset. Models were separately trained using nine different HPO methods: 1) random sampling, 2) simulated annealing, 3) quasi-Monte Carlo sampling, 4-5) two variations of Bayesian hyper-parameter optimization via tree-Parzen estimation, 6-7) two implementations of Bayesian hyper-parameter optimization via Gaussian processes, 8) Bayesian hyper-parameter optimization via random forests, and 9) the covariance matrix adaptation evolutionary strategy. For each HPO method, we estimated 100 extreme gradient boosting models at different hyper-parameter configurations; and evaluated model performance using an AUC metric on a randomly sampled validation dataset. Using the best model identified by each HPO method, we evaluated generalization performance in terms of discrimination and calibration metrics on a randomly sampled held-out test dataset (internal validation) and a temporally independent dataset (external validation).

Results — The extreme gradient boosting model estimated using default hyper-parameter settings had reasonable discrimination (AUC=0.82) but was not well calibrated. Hyper-parameter tuning using any HPO algorithm/sampler improved model discrimination (AUC=0.84), resulted in models with near perfect calibration, and consistently identified features predictive of high-need high-cost health care users.

Conclusions — In our study, all HPO algorithms resulted in similar gains in model performance relative to baseline models. This finding likely relates to our study dataset having a large sample size, a relatively small number of features, and a strong signal to noise ratio; and would likely apply to other datasets with similar characteristics.

Information

Citation

Meaney C, Guan J, Wang X, Stukel T. BMC Med Res Methodol. 2025;25(1): 134. Epub 2025 May 15.

View Source

Contributing ICES Scientists

Associated Topics

Associated Sites