Development and validation of a machine learning model using administrative health data to predict onset of type 2 diabetes

Importance — Systems-level barriers to diabetes care could be ameliorated with population health planning tools that accurately discriminate between high and low-risk groups to guide investments and targeted interventions.

Objective — To develop and validate a population-level, machine learning model for predicting type-2 diabetes onset five years ahead using administrative health data.

Design, Setting, Participants — This study used linked administrative health data from the diverse, single-payer health system in Ontario, Canada between 2006 and 2016. A Gradient Boosting Decision Tree Model was trained on data from 1,657,395 patients, validated on 243,442 patients, and tested on 236,506 patients. Finally, costs associated with each patient were estimated using a validated costing algorithm.

Exposures — A random sample of 2,137,343 residents of Ontario without type-2 diabetes was obtained at study start time. Over 300 features from datasets capturing demographics, laboratory measurements, drug benefits, healthcare system interactions, the social determinants of health, and ambulatory care and hospitalization records were compiled over two-year patient medical histories to generate quarterly predictions .

Main Outcome and Measures — Discrimination was assessed using the AUC statistic and calibration was assessed visually using calibration plots. Feature contribution was assessed with Shapley values. Costs were estimated in 2020 USD.

Results — The developed model achieved a test AUC of 80.26 (range 80.21 – 80.29), demonstrated good calibration and was robust to sex, immigration status, area-level marginalization with regards to material deprivation and ethnicity, and low contact with the healthcare system. The top 5% of patients predicted as high-risk by the model represented 26% of the total annual diabetes cost in Ontario.

Conclusions and Relevance — In this study, a machine learning model approach accurately predicted the incidence of diabetes in the population using routinely collected health administrative data and could be used to inform decision-making for population-health planning and diabetes prevention.

View Source

Information

Citation

Ravaut M, Harish V, Sadeghi H, Leung KK, Volkovs M, Kornas K, Watson T, Poutanen T, Rosella LC. JAMA Netw Open. 2021; 4(5):e2111315. Epub 2021 May 25.

View Source

Contributing ICES Scientists

Laura Rosella

Associated Sites

ICES UofT

Infographic

Download Infographic Click to View

News Releases

News Release

25/05/2021

Researchers develop a machine learning model that accurately predicts diabetes using routinely collected and linked health data

Discover More

Journal Article

25/04/2024

Multifetal pregnancy after implementation of a publicly funded fertility program

Velez MP, Soule A, Gaudet L, Pudwell J, Nguyen P, Ray JG. JAMA Netw Open. 2024; 7(4):e248496. Epub 2024 Apr 25.

Journal Article

25/04/2024

Proportion of life spent in Canada and the incidence of multiple sclerosis in permanent immigrants

Vyas MV, Kapral MK, Rea A, Fang J, Rotstein DL. Neurology. 2024; 102(10):e209350. Epub 2024 Apr 24.

Journal Article

16/04/2024

Association of blood mitochondrial DNA copy number with risk of acute kidney injury after cardiac surgery

Jotwani V, Thiessen-Philbrook H, rking DE, Yang SY, McArthur E, Garg AX, Katz R, Tranah GJ, Ix JH, Cummings S, Waikar SS, Sarnak MJ, Shlipak MG, Parikh SM, Parikh CR. Am J Kidney Dis. 2024; Apr 16 [Epub ahead of print].

See All