Skip to main content

Development and validation of a machine learning model using administrative health data to predict onset of type 2 diabetes

Ravaut M, Harish V, Sadeghi H, Leung KK, Volkovs M, Kornas K, Watson T, Poutanen T, Rosella LC. JAMA Netw Open. 2021; 4(5):e2111315. Epub 2021 May 25. DOI:

Importance — Systems-level barriers to diabetes care could be ameliorated with population health planning tools that accurately discriminate between high and low-risk groups to guide investments and targeted interventions.

Objective — To develop and validate a population-level, machine learning model for predicting type-2 diabetes onset five years ahead using administrative health data.

Design, Setting, Participants — This study used linked administrative health data from the diverse, single-payer health system in Ontario, Canada between 2006 and 2016. A Gradient Boosting Decision Tree Model was trained on data from 1,657,395 patients, validated on 243,442 patients, and tested on 236,506 patients. Finally, costs associated with each patient were estimated using a validated costing algorithm.

Exposures — A random sample of 2,137,343 residents of Ontario without type-2 diabetes was obtained at study start time. Over 300 features from datasets capturing demographics, laboratory measurements, drug benefits, healthcare system interactions, the social determinants of health, and ambulatory care and hospitalization records were compiled over two-year patient medical histories to generate quarterly predictions .

Main Outcome and Measures — Discrimination was assessed using the AUC statistic and calibration was assessed visually using calibration plots. Feature contribution was assessed with Shapley values. Costs were estimated in 2020 USD.

Results — The developed model achieved a test AUC of 80.26 (range 80.21 - 80.29), demonstrated good calibration and was robust to sex, immigration status, area-level marginalization with regards to material deprivation and ethnicity, and low contact with the healthcare system. The top 5% of patients predicted as high-risk by the model represented 26% of the total annual diabetes cost in Ontario.

Conclusions and Relevance — In this study, a machine learning model approach accurately predicted the incidence of diabetes in the population using routinely collected health administrative data and could be used to inform decision-making for population-health planning and diabetes prevention.

View full text