Publication

An Interpretable Machine Learning Framework for Rare Disease: A Case Study to Stratify Infection Risk in Pediatric Leukemia

Downloadable Content

Persistent URL
Last modified
  • 01/14/2026
Type of Material
Authors
    Irfan Al-Hussaini, Emory UniversityBrandon White, Emory UniversityArmon Varmeziar, Emory UniversityNidhi Mehra, Emory UniversityMilagro Sanchez, Emory UniversityJudy Lee, Children’s Healthcare of AtlantaNicholas P. DeGroote, Children’s Healthcare of AtlantaTamara P. Miller, Children’s Healthcare of AtlantaCassie S. Mitchell, Emory University
Language
  • English
Date
  • 2024-03-20
Publisher
  • MDPI
Publication Version
Copyright Statement
  • © 2024 by the authors.
License
Final Published Version (URL)
Title of Journal or Parent Work
Volume
  • 13
Issue
  • 6
Start Page
  • 1788
Grant/Funding Agency
  • Georgia Institute of Technology
  • Chan Zuckerberg Initiative
  • National Science Foundation
  • Children’s Healthcare of Atlanta
  • Aflac Cancer and Blood Disorders Center
Grant/Funding Information
  • This research was funded by the Georgia Institute of Technology President’s Undergraduate Research Award to N.M. and B.W.; NIH grant R21CA232249 to C.M.; Aflac Pilot Grant from Aflac Cancer and Blood Disorders Center, Children’s Healthcare of Atlanta to T.P.M and C.S.M.; National Science Foundation CAREER award 1944247 to C.S.M.; and the Chan Zuckerberg Initiative grant 253558 to C.S.M.
Abstract
  • Background: Datasets on rare diseases, like pediatric acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), have small sample sizes that hinder machine learning (ML). The objective was to develop an interpretable ML framework to elucidate actionable insights from small tabular rare disease datasets. Methods: The comprehensive framework employed optimized data imputation and sampling, supervised and unsupervised learning, and literature-based discovery (LBD). The framework was deployed to assess treatment-related infection in pediatric AML and ALL. Results: An interpretable decision tree classified the risk of infection as either “high risk” or “low risk” in pediatric ALL (n = 580) and AML (n = 132) with accuracy of ∼79%. Interpretable regression models predicted the discrete number of developed infections with a mean absolute error (MAE) of 2.26 for bacterial infections and an MAE of 1.29 for viral infections. Features that best explained the development of infection were the chemotherapy regimen, cancer cells in the central nervous system at initial diagnosis, chemotherapy course, leukemia type, Down syndrome, race, and National Cancer Institute risk classification. Finally, SemNet 2.0, an open-source LBD software that links relationships from 33+ million PubMed articles, identified additional features for the prediction of infection, like glucose, iron, neutropenia-reducing growth factors, and systemic lupus erythematosus (SLE). Conclusions: The developed ML framework enabled state-of-the-art, interpretable predictions using rare disease tabular datasets. ML model performance baselines were successfully produced to predict infection in pediatric AML and ALL.
Author Notes
  • The authors have no conflicts of interest.
  • cassie.mitchell@bme.gatech.edu
  • Irfan Al-Hussaini: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization. Brandon White: Conceptualization, Methodology, Software, Formal analysis, Writing – original draft. Armon Varmeziar: Methodology, Validation, Formal analysis, Writing – original draft. Nidhi Mehra: Conceptualization, Methodology, Software, Formal analysis. Milagro Sanchez: Methodology, Formal analysis. Judy Lee: Conceptualization, Validation, Resources, Data curation, Writing – review & editing. Nicholas P DeGroote: Conceptualization, Validation, Resources, Data curation, Writing – review & editing. Tamara P Miller: Conceptualization, Validation, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Supervision, Project administration, Funding acquisition. Cassie S Mitchell: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration, Funding acquisition. Håkon Reikvam: Academic Editor.
Keywords
Research Categories
  • Developmental biology
  • Oncology

Tools

Relations

In Collection:

Items