About this item:

98 Views | 54 Downloads

Author Notes:

Abeed Sarker, PhD, Department of Biomedical Informatics, School of Medicine, Emory University, 101 Woodruff Cir, Fourth Floor E, Atlanta, GA 30322. Email:

Y. Guo: conceptualization, methods, investigation, data curation, writing original draft and review and editing, and analysis; Dr Al‐Garadi: conceptualization, methods, investigation, and data curation; Dr Book: conceptualization, methods, investigation, resources, data curation, writing original draft and review and editing, supervision, and funding acquisition; L. C. Ivey: methods, data curation, editing draft, and supervision; Dr Rodriguez: methods, data curation, editing draft, and supervision; C. L. Raskind‐Hood: conceptualization, methods, and draft editing; C. Robichaux: data curation; Abeed Sarker: conceptualization, methods, investigation, data curation, writing original draft and review and editing, preparation, supervision, and project administration.

Disclosures: None.

Subjects:

Research Funding:

This work was supported by the Centers for Disease Control and Prevention Cooperative Agreement, Congenital Heart Defects Surveillance Across Time and Regions grant/award number CDC‐RFA‐DD19‐1902B.

Keywords:

  • Science & Technology
  • Life Sciences & Biomedicine
  • Cardiac & Cardiovascular Systems
  • Cardiovascular System & Cardiology
  • congenital heart disease
  • Fontan
  • natural language processing
  • single ventricle
  • CLINICAL NOTES
  • IDENTIFICATION

Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes

Tools:

Journal Title:

JOURNAL OF THE AMERICAN HEART ASSOCIATION

Volume:

Volume 12, Number 13

Publisher:

, Pages e030046-e030046

Type of Work:

Article | Final Publisher PDF

Abstract:

BACKGROUND: The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by International Classification of Diseases (ICD) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing–based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with ICD code–based classification. METHODS AND RESULTS: We included free-text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non-Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer-based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and ICD code–based classification on 20% of the held-out patient data using the F1 score metric. The ICD classification model, support vector machine, and RoBERTa achieved F1 scores of 0.81 (95% CI, 0.79–0.83), 0.95 (95% CI, 0.92–0.97), and 0.89 (95% CI, 0.88–0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance (P<0.05), and both natural language processing models outperformed ICD code–based classification (P<0.05). The sliding window strategy improved performance over the base model (P<0.05) but did not outperform support vector machines. ICD code–based classification produced more false positives. CONCLUSIONS: Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes, and the former demonstrated the possibility of further improvement.

Copyright information:

© 2023 The Authors.

This is an Open Access work distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (https://creativecommons.org/licenses/by-nc/4.0/).
Export to EndNote