Hospital-acquired pressure ulcer injury (PUI) is a primary nursing quality metric, reflecting the caliber of nursing care within a hospital. Prior studies have used the Braden scale and structured data from the electronic health records to detect/predict PUI while the informative unstructured clinical notes have not been used. We propose automated PUI detection using a novel negation-detection algorithm applied to unstructured clinical notes. Our detection framework is on-demand, requiring minimal cost. In application to the MIMIC-III dataset, the text features produced using our algorithm resulted in improved PUI detection when evaluated using logistic regression, random forests, and neural networks compared to text features without negation detection. Exploratory analysis reveals substantial overlap between key classifier features and leading clinical attributes of PUI, adding interpretability to our solution. Our method could also considerably reduce nurses' evaluations by automatic detection of most cases, leaving only the most uncertain cases for nursing assessment.
Representation learning on static graph-structured data has shown a significant impact on many real-world applications. However, less attention has been paid to the evolving nature of temporal networks, in which the edges are often changing over time. The embeddings of such temporal networks should encode both graph-structured information and the temporally evolving pattern. Existing approaches in learning temporally evolving network representations fail to capture the temporal interdependence. In this paper, we propose Toffee, a novel approach for temporal network representation learning based on tensor decomposition. Our method exploits the tensor-tensor product operator to encode the cross-time information, so that the periodic changes in the evolving networks can be captured. Experimental results demonstrate that Toffee outperforms existing methods on multiple real-world temporal networks in generating effective embeddings for the link prediction tasks.
Tensor factorization has been proved as an efficient unsupervised learning approach for health data analysis, especially for computational phenotyping, where the high-dimensional Electronic Health Records (EHRs) with patients history of medical procedures, medications, diagnosis, lab tests, etc., are converted to meaningful and interpretable medical concepts. Federated tensor factorization distributes the tensor computation to multiple workers under the coordination of a central server, which enables jointly learning the phenotypes across multiple hospitals while preserving the privacy of the patient information. However, existing federated tensor factorization algorithms encounter the single-point-failure issue with the involvement of the central server, which is not only easily exposed to external attacks, but also limits the number of clients sharing information with the server under restricted uplink bandwidth. In this paper, we propose CiderTF, a communication-efficient decentralized generalized tensor factorization, which reduces the uplink communication cost by leveraging a four-level communication reduction strategy designed for a generalized tensor factorization, which has the flexibility of modeling different tensor distribution with multiple kinds of loss functions. Experiments on two real-world EHR datasets demonstrate that CiderTF achieves comparable convergence with the communication reduction up to 99.99%.
Schema matching aims to identify the correspondences among attributes of database schemas. It is frequently considered as the most challenging and decisive stage existing in many contemporary web semantics and database systems. Low-quality algorithmic matchers fail to provide improvement while manually annotation consumes extensive human efforts. Further complications arise from data privacy in certain domains such as healthcare, where only schema-level matching should be used to prevent data leakage. For this problem, we propose SMAT, a new deep learning model based on state-of-the-art natural language processing techniques to obtain semantic mappings between source and target schemas using only the attribute name and description. SMAT avoids directly encoding domain knowledge about the source and target systems, which allows it to be more easily deployed across different sites. We also introduce a new benchmark dataset, OMAP, based on real-world schema-level mappings from the healthcare domain. Our extensive evaluation of various benchmark datasets demonstrates the potential of SMAT to help automate schema-level matching tasks.
Sequential pattern mining can be used to extract meaningful sequences from electronic health records. However, conventional sequential pattern mining algorithms that discover all frequent sequential patterns can incur a high computational and be susceptible to noise in the observations. Approximate sequential pattern mining techniques have been introduced to address these shortcomings yet, existing approximate methods fail to reflect the true frequent sequential patterns or only target single-item event sequences. Multi-item event sequences are prominent in healthcare as a patient can have multiple interventions for a single visit. To alleviate these issues, we propose GASP, a graph-based approximate sequential pattern mining, that discovers frequent patterns for multi-item event sequences. Our approach compresses the sequential information into a concise graph structure which has computational benefits. The empirical results on two healthcare datasets suggest that GASP outperforms existing approximate models by improving recoverability and extracts better predictive patterns.
From electronic health records (EHRs), the relationship between patients' conditions, treatments, and outcomes can be discovered and used in various healthcare research tasks such as risk prediction. In practice, EHRs can be stored in one or more data warehouses, and mining from distributed data sources becomes challenging. Another challenge arises from privacy laws because patient data cannot be used without some patient privacy guarantees. Thus, in this paper, we propose a privacy-preserving framework using sequential pattern mining in distributed data sources. Our framework extracts patterns from each source and shares patterns with other sources to discover discriminative and representative patterns that can be used for risk prediction while preserving privacy. We demonstrate our framework using a case study of predicting Cardiovascular Disease in patients with type 2 diabetes and show the effectiveness of our framework with several sources and by applying differential privacy mechanisms.
Systematic review (SR) is an essential process to identify, evaluate, and summarize the findings of all relevant individual studies concerning health-related questions. However, conducting a SR is labor-intensive, as identifying relevant studies is a daunting process that entails multiple researchers screening thousands of articles for relevance. In this paper, we propose MMiDaS-AE, a Multi-modal Missing Data aware Stacked Autoencoder, for semi-automating screening for SRs. We use a multi-modal view that exploits three representations, of: 1) documents, 2) topics, and 3) citation networks. Documents that contain similar words will be nearby in the document embedding space. Models can also exploit the relationship between documents and the associated SR MeSH terms to capture article relevancy. Finally, related works will likely share the same citations, and thus closely related articles would, intuitively, be trained to be close to each other in the embedding space. However, using all three learned representations as features directly result in an unwieldy number of parameters. Thus, motivated by recent work on multi-modal auto-encoders, we adopt a multi-modal stacked autoencoder that can learn a shared representation encoding all three representations in a compressed space. However, in practice one or more of these modalities may be missing for an article (e.g., if we cannot recover citation information). Therefore, we propose to learn to impute the shared representation even when specific inputs are missing. We find this new model significantly improves performance on a dataset consisting of 15 SRs compared to existing approaches.
Background: Diabetes and hypertension disparities are pronounced among South Asians. There is regional variation in the prevalence of diabetes and hypertension in the US, but it is unknown whether there is variation among South Asians living in the US. The objective of this study was to compare the burden of diabetes and hypertension between South Asian patients receiving care in the health systems of two US cities. Methods: Cross-sectional analyses were performed using electronic health records (EHR) for 90,137 South Asians receiving care at New York University Langone in New York City (NYC) and 28,868 South Asians receiving care at Emory University (Atlanta). Diabetes was defined as having 2 + encounters with a diagnosis of diabetes, having a diabetes medication prescribed (excluding Acarbose/Metformin), or having 2 + abnormal A1C levels (≥ 6.5%) and 1 + encounter with a diagnosis of diabetes. Hypertension was defined as having 3 + BP readings of systolic BP ≥ 130 mmHg or diastolic BP ≥ 80 mmHg, 2 + encounters with a diagnosis of hypertension, or having an anti-hypertensive medication prescribed. Results: Among South Asian patients at these two large, private health systems, age-adjusted diabetes burden was 10.7% in NYC compared to 6.7% in Atlanta. Age-adjusted hypertension burden was 20.9% in NYC compared to 24.7% in Atlanta. In Atlanta, 75.6% of those with diabetes had comorbid hypertension compared to 46.2% in NYC. Conclusions: These findings suggest differences by region and sex in diabetes and hypertension risk. Additionally, these results call for better characterization of race/ethnicity in EHRs to identify ethnic subgroup variation, as well as intervention studies to reduce lifestyle exposures that underlie the elevated risk for type 2 diabetes and hypertension development in South Asians.
Background: Unstructured data from clinical epidemiological studies can be valuable and easy to obtain. However, it requires further extraction and processing for data analysis. Doing this manually is labor-intensive, slow and subject to error. In this study, we propose an automation framework for extracting and processing unstructured data. Methods: The proposed automation framework consisted of two natural language processing (NLP) based tools for unstructured text data for medications and reasons for medication use. We first checked spelling using a spell-check program trained on publicly available knowledge sources and then applied NLP techniques. We mapped medication names into generic names using vocabulary from publicly available knowledge sources. We used WHO’s Anatomical Therapeutic Chemical (ATC) classification system to map generic medication names to medication classes. We processed the reasons for medication with the Lancaster stemmer method and then grouped and mapped to disease classes based on organ systems. Finally, we demonstrated this automation framework on two data sources for Mylagic Encephalomyelitis/ Chronic Fatigue Syndrome (ME/CFS): tertiary-based (n = 378) and population-based (n = 664) samples. Results: A total of 8681 raw medication records were used for this demonstration. The 1266 distinct medication names (omitting supplements) were condensed to 89 ATC classification system categories. The 1432 distinct raw reasons for medication use were condensed to 65 categories via NLP. Compared to completion of the entire process manually, our automation process reduced the number of the terms requiring manual labor for mapping by 84.4% for medications and 59.4% for reasons for medication use. Additionally, this process improved the precision of the mapped results. Conclusions: Our automation framework demonstrates the usefulness of NLP strategies even when there is no established mapping database. For a less established database (e.g., reasons for medication use), the method is easily modifiable as new knowledge sources for mapping are introduced. The capability to condense large features into interpretable ones will be valuable for subsequent analytical studies involving techniques such as machine learning and data mining.
Background: Researchers are developing methods to automatically extract clinically relevant and useful patient characteristics from raw healthcare datasets. These characteristics, often capturing essential properties of patients with common medical conditions, are called computational phenotypes. Being generated by automated or semiautomated, data-driven methods, such potential phenotypes need to be validated as clinically meaningful (or not) before they are acceptable for use in decision making. Objective: The objective of this study was to present Phenotype Instance Verification and Evaluation Tool (PIVET), a framework that uses co-occurrence analysis on an online corpus of publicly available medical journal articles to build clinical relevance evidence sets for user-supplied phenotypes. PIVET adopts a conceptual framework similar to the pioneering prototype tool PheKnow-Cloud that was developed for the phenotype validation task. PIVET completely refactors each part of the PheKnow-Cloud pipeline to deliver vast improvements in speed without sacrificing the quality of the insights PheKnow-Cloud achieved. Methods: PIVET leverages indexing in NoSQL databases to efficiently generate evidence sets. Specifically, PIVET uses a succinct representation of the phenotypes that corresponds to the index on the corpus database and an optimized co-occurrence algorithm inspired by the Aho-Corasick algorithm. We compare PIVET's phenotype representation with PheKnow-Cloud's by using PheKnow-Cloud's experimental setup. In PIVET's framework, we also introduce a statistical model trained on domain expert-verified phenotypes to automatically classify phenotypes as clinically relevant or not. Additionally, we show how the classification model can be used to examine user-supplied phenotypes in an online, rather than batch, manner. Results: PIVET maintains the discriminative power of PheKnow-Cloud in terms of identifying clinically relevant phenotypes for the same corpus with which PheKnow-Cloud was originally developed, but PIVET's analysis is an order of magnitude faster than that of PheKnow-Cloud. Not only is PIVET much faster, it can be scaled to a larger corpus and still retain speed. We evaluated multiple classification models on top of the PIVET framework and found ridge regression to perform best, realizing an average F1 score of 0.91 when predicting clinically relevant phenotypes. Conclusions: Our study shows that PIVET improves on the most notable existing computational tool for phenotype validation in terms of speed and automation and is comparable in terms of accuracy.