AIMS: Various cardiovascular risk prediction models have been developed for patients with type 2 diabetes mellitus. Yet few models have been validated externally. We perform a comprehensive validation of existing risk models on a heterogeneous population of patients with type 2 diabetes using secondary analysis of electronic health record data. METHODS: Electronic health records of 47,988 patients with type 2 diabetes between 2013 and 2017 were used to validate 16 cardiovascular risk models, including 5 that had not been compared previously, to estimate the 1-year risk of various cardiovascular outcomes. Discrimination and calibration were assessed by the c-statistic and the Hosmer-Lemeshow goodness-of-fit statistic, respectively. Each model was also evaluated based on the missing measurement rate. Sub-analysis was performed to determine the impact of race on discrimination performance. RESULTS: There was limited discrimination (c-statistics ranged from 0.51 to 0.67) across the cardiovascular risk models. Discrimination generally improved when the model was tailored towards the individual outcome. After recalibration of the models, the Hosmer-Lemeshow statistic yielded p-values above 0.05. However, several of the models with the best discrimination relied on measurements that were often imputed (up to 39% missing). CONCLUSION: No single prediction model achieved the best performance on a full range of cardiovascular endpoints. Moreover, several of the highest-scoring models relied on variables with high missingness frequencies such as HbA1c and cholesterol that necessitated data imputation and may not be as useful in practice. An open-source version of our developed Python package, cvdm, is available for comparisons using other data sources.
Background: Patients develop pressure injuries (PIs) in the hospital owing to low mobility, exposure to localized pressure, circulatory conditions, and other predisposing factors. Over 2.5 million Americans develop PIs annually. The Center for Medicare and Medicaid considers hospital-acquired PIs (HAPIs) as the most frequent preventable event, and they are the second most common claim in lawsuits. With the growing use of electronic health records (EHRs) in hospitals, an opportunity exists to build machine learning models to identify and predict HAPI rather than relying on occasional manual assessments by human experts. However, accurate computational models rely on high-quality HAPI data labels. Unfortunately, the different data sources within EHRs can provide conflicting information on HAPI occurrence in the same patient. Furthermore, the existing definitions of HAPI disagree with each other, even within the same patient population. The inconsistent criteria make it impossible to benchmark machine learning methods to predict HAPI. Objective: The objective of this project was threefold. We aimed to identify discrepancies in HAPI sources within EHRs, to develop a comprehensive definition for HAPI classification using data from all EHR sources, and to illustrate the importance of an improved HAPI definition. Methods: We assessed the congruence among HAPI occurrences documented in clinical notes, diagnosis codes, procedure codes, and chart events from the Medical Information Mart for Intensive Care III database. We analyzed the criteria used for the 3 existing HAPI definitions and their adherence to the regulatory guidelines. We proposed the Emory HAPI (EHAPI), which is an improved and more comprehensive HAPI definition. We then evaluated the importance of the labels in training a HAPI classification model using tree-based and sequential neural network classifiers. Results: We illustrate the complexity of defining HAPI, with <13% of hospital stays having at least 3 PI indications documented across 4 data sources. Although chart events were the most common indicator, it was the only PI documentation for >49% of the stays. We demonstrate a lack of congruence across existing HAPI definitions and EHAPI, with only 219 stays having a consensus positive label. Our analysis highlights the importance of our improved HAPI definition, with classifiers trained using our labels outperforming others on a small manually labeled set from nurse annotators and a consensus set in which all definitions agreed on the label. Conclusions: Standardized HAPI definitions are important for accurately assessing HAPI nursing quality metric and determining HAPI incidence for preventive measures. We demonstrate the complexity of defining an occurrence of HAPI, given the conflicting and incomplete EHR data. Our EHAPI definition has favorable properties, making it a suitable candidate for HAPI classification tasks.
To determine the hepatitis C virus (HCV) care cascade among persons who were born during 1945 to 1965 and received outpatient care on or after January 2014 at a large academic healthcare system. Deidentified electronic health record data in an existing research database were analyzed for this study. Laboratory test results for HCV antibody and HCV ribonucleic acid (RNA) indicated seropositivity and confirmatory testing. HCV genotyping was used as a proxy for linkage to care. A direct-acting antiviral (DAA) prescription indicated treatment initiation, an undetectable HCV RNA at least 20 weeks after initiation of antiviral treatment indicated a sustained virologic response. Of the 121,807 patients in the 1945 to 1965 birth cohort who received outpatient care between January 1, 2014 and June 30, 2017, 3399 (3%) patients were screened for HCV; 540 (16%) were seropositive. Among the seropositive, 442 (82%) had detectable HCV RNA, 68 (13%) had undetectable HCV RNA, and 30 (6%) lacked HCV RNA testing. Of the 442 viremic patients, 237 (54%) were linked to care, 65 (15%) initiated DAA treatment, and 32 (7%) achieved sustained virologic response. While only 3% were screened for HCV, the seroprevalence was high in the screened sample. Despite the established safety and efficacy of DAAs, only 15% initiated treatment during the study period. To achieve HCV elimination, improved HCV screening and linkage to HCV care and DAA treatment are needed.
Sequential pattern mining can be used to extract meaningful sequences from electronic health records. However, conventional sequential pattern mining algorithms that discover all frequent sequential patterns can incur a high computational and be susceptible to noise in the observations. Approximate sequential pattern mining techniques have been introduced to address these shortcomings yet, existing approximate methods fail to reflect the true frequent sequential patterns or only target single-item event sequences. Multi-item event sequences are prominent in healthcare as a patient can have multiple interventions for a single visit. To alleviate these issues, we propose GASP, a graph-based approximate sequential pattern mining, that discovers frequent patterns for multi-item event sequences. Our approach compresses the sequential information into a concise graph structure which has computational benefits. The empirical results on two healthcare datasets suggest that GASP outperforms existing approximate models by improving recoverability and extracts better predictive patterns.
From electronic health records (EHRs), the relationship between patients' conditions, treatments, and outcomes can be discovered and used in various healthcare research tasks such as risk prediction. In practice, EHRs can be stored in one or more data warehouses, and mining from distributed data sources becomes challenging. Another challenge arises from privacy laws because patient data cannot be used without some patient privacy guarantees. Thus, in this paper, we propose a privacy-preserving framework using sequential pattern mining in distributed data sources. Our framework extracts patterns from each source and shares patterns with other sources to discover discriminative and representative patterns that can be used for risk prediction while preserving privacy. We demonstrate our framework using a case study of predicting Cardiovascular Disease in patients with type 2 diabetes and show the effectiveness of our framework with several sources and by applying differential privacy mechanisms.