Patient similarity measurement is an important tool for cohort identification in clinical decision support applications. A reliable similarity metric can be used for deriving diagnostic or prognostic information about a target patient using other patients with similar trajectories of health-care events. However, the measure of similar care trajectories is challenged by the irregularity of measurements, inherent in health care. To address this challenge, we propose a novel temporal similarity measure for patients based on irregularly measured laboratory test data from the Multiparameter Intelligent Monitoring in Intensive Care database and the pediatric Intensive Care Unit (ICU) database of Children's Healthcare of Atlanta. This similarity measure, which is modified from the Smith Waterman algorithm, identifies patients that share sequentially similar laboratory results separated by time intervals of similar length. We demonstrate the predictive power of our method; that is, patients with higher similarity in their previous histories will most likely have higher similarity in their later histories. In addition, compared with other non-temporal measures, our method is stronger at predicting mortality in ICU patients diagnosed with acute kidney injury and sepsis.
The HIV-1 envelope (Env) spike, which consists of a compact, heterodimeric trimer of the glycoproteins gp120 and gp41, is the target of neutralizing antibodies. However, the high mutation rate of HIV-1 and plasticity of Env facilitates viral evasion from neutralizing antibodies through various mechanisms. Mutations that are distant from the antibody binding site can lead to escape, probably by changing the conformation or dynamics of Env; however, these changes are difficult to identify and define mechanistically. Here we describe a network analysis-based approach to identify potential allosteric immune evasion mechanisms using three known HIV-1 Env gp120 protein structures from two different clades, B and C. First, correlation and principal component analyses of molecular dynamics (MD) simulations identified a high degree of long-distance coupled motions that exist between functionally distant regions within the intrinsic dynamics of the gp120 core, supporting the presence of long-distance communication in the protein. Then, by integrating MD simulations with network theory, we identified the optimal and suboptimal communication pathways and modules within the gp120 core. The results unveil both strain-dependent and -independent characteristics of the communication pathways in gp120. We show that within the context of three structurally homologous gp120 cores, the optimal pathway for communication is sequence sensitive, i.e. a suboptimal pathway in one strain becomes the optimal pathway in another strain. Yet the identification of conserved elements within these communication pathways, termed inter-modular hotspots, could present a new opportunity for immunogen design, as this could be an additional mechanism that HIV-1 uses to shield vulnerable antibody targets in Env that induce neutralizing antibody breadth.
Association rule mining has been utilized extensively in many areas because it has the ability to discover relationships among variables in large databases. However, one main drawback of association rule mining is that it attempts to generate a large number of rules and does not guarantee that the rules are meaningful in the real world. Many visualization techniques have been proposed for association rules. These techniques were designed to provide a global overview of all rules so as to identify the most meaningful rules. However, using these visualization techniques to search for specific rules becomes challenging especially when the volume of rules is extremely large.
In this study, we have developed an interactive association rule visualization technique, called InterVisAR, specifically designed for effective rule search. We conducted a user study with 24 participants, and the results demonstrated that InterVisAR provides an efficient and accurate visualization solution. We also verified that InterVisAR satisfies a non-factorial property that should be guaranteed in performing rule search. All participants also expressed high preference towards InterVisAR as it provides a more comfortable and pleasing visualization in association rule search comparing with table-based rule search.
In biomedical data analysis, inferring the cause of death is a challenging and important task, which is useful for both public health reporting purposes, as well as improving patients' quality of care by identifying severer conditions. Causal inference, however, is notoriously difficult. Traditional causal inference mainly relies on analyzing data collected from experiment of specific design, which is expensive, and limited to a certain disease cohort, making the approach less generalizable. In our paper, we adopt a novel data-driven perspective to analyze and improve the death reporting process, to assist physicians identify the single underlying cause of death. To achieve this, we build state-of-the-art deep learning models, convolution neural network (CNN), and achieve around 75% accuracy in predicting the single underlying cause of death from a list of relevant medical conditions. We also provide interpretations for the black-box neural network models, so that death reporting physicians can apply the model with better understanding of the model.
by
Lucy B Spalluto;
Jennifer A Lewis;
Deonn Stolldorf;
Vivian M Yeh;
Carol Callaway-Lane;
Renda S Wiener;
Christopher G Slatore;
David F Yankelevitz;
Claudia Henschke;
Timothy J Vogus;
Pierre P Massion;
Drew Moghanaki;
Christianne L Roumie
Objectives: Lung cancer has the highest cancer-related mortality in the United States and among Veterans. Screening of high-risk individuals with low-dose CT (LDCT) can improve survival through detection of early-stage lung cancer. Organizational factors that aid or impede implementation of this evidence-based practice in diverse populations are not well described. We evaluated organizational readiness for change and change valence (belief that change is beneficial and valuable) for implementation of LDCT screening. Methods: We performed a cross-sectional survey of providers, staff, and administrators in radiology and primary care at a single Veterans Affairs Medical Center. Survey measures included Shea's validated Organizational Readiness for Implementing Change (ORIC) scale and Shea's 10 items to assess change valence. ORIC and change valence were scored on a scale from 1 to 7 (higher scores representing higher readiness for change or valence). Multivariable linear regressions were conducted to determine predictors of ORIC and change valence. Results: Of 523 employees contacted, 282 completed survey items (53.9% overall response rate). Higher ORIC scores were associated with radiology versus primary care (mean 5.48, SD 1.42 versus 5.07, SD 1.22, β = 0.37, P = .039). Self-identified leaders in lung cancer screening had both higher ORIC (5.56, SD 1.39 versus 5.11, SD 1.26, β = 0.43, P = .050) and change valence scores (5.89, SD 1.21 versus 5.36, SD 1.19, β = 0.51, P = .012). Discussion: Radiology health professionals have higher levels of readiness for change for implementation of LDCT screening than those in primary care. Understanding health professionals’ behavioral determinants for change can inform future lung cancer screening implementation strategies.
Breast cancer is the most prevalent and among the most deadly cancers in females. Patients with breast cancer have highly variable survival rates, indicating a need to identify prognostic biomarkers. By integrating multi-omics data (e.g., gene expression, DNA methylation, miRNA expression, and copy number variations (CNVs)), it is likely to improve the accuracy of patient survival predictions compared to prediction using single modality data. Therefore, we propose to develop a machine learning pipeline using decision-level integration of multi-omics tumor data from The Cancer Genome Atlas (TCGA) to predict the overall survival of breast cancer patients.
With multi-omics data consisting of gene expression, methylation, miRNA expression, and CNVs, the top-performing model predicted survival with an accuracy of 85% and area under the curve (AUC) of 87%. Furthermore, the model was able to identify which modalities best contributed to prediction performance, identifying methylation, miRNA, and gene expression as the best integrated classification combination. Our method not only recapitulated several breast cancer-specific prognostic biomarkers that were previously reported in the literature but also yielded several novel biomarkers. Further analysis of these biomarkers could lend insight into the molecular mechanisms that lead to poor survival.
by
Denise Bos;
Sophronia Yu;
Jason Luong;
Philip Chu;
Yifei Wang;
Andrew J Einstein;
Jay Starkey;
Bradley N Delman;
Phuong-Anh T Duong;
Marco Das;
Sebastian Schindera;
Allen R Goode;
Fiona MacLeod;
Axel Wetter;
Rebecca Neill;
Ryan K Lee;
Jodi Roehm;
James A Seibert;
Luisa F Cervantes;
Nima Kasraie;
Pavlina Pike;
Anokh Pahwa;
Cécile RLPN Jeukens;
Rebecca Smith-Bindman
Objectives: The European Society of Radiology identified 10 common indications for computed tomography (CT) as part of the European Study on Clinical Diagnostic Reference Levels (DRLs, EUCLID), to help standardize radiation doses. The objective of this study is to generate DRLs and median doses for these indications using data from the UCSF CT International Dose Registry. Methods: Standardized data on 3.7 million CTs in adults were collected between 2016 and 2019 from 161 institutions across seven countries (United States of America (US), Switzerland, Netherlands, Germany, UK, Israel, Japan). DRLs (75th percentile) and median doses for volumetric CT-dose index (CTDIvol) and dose-length product (DLP) were assessed for each EUCLID category (chronic sinusitis, stroke, cervical spine trauma, coronary calcium scoring, lung cancer, pulmonary embolism, coronary CT angiography, hepatocellular carcinoma (HCC), colic/abdominal pain, appendicitis), and US radiation doses were compared with European. Results: The number of CT scans within EUCLID categories ranged from 8,933 (HCC) to over 1.2 million (stroke). There was greater variation in dose between categories than within categories (p <.001), and doses were significantly different between categories within anatomic areas. DRLs and median doses were assessed for all categories. DRLs were higher in the US for 9 of the 10 indications (except chronic sinusitis) than in Europe but with a significantly higher sample size in the US. Conclusions: DRLs for CTDIvol and DLP for EUCLID clinical indications from diverse organizations were established and can contribute to dose optimization. These values were usually significantly higher in the US than in Europe. Key Points: • Registry data were used to create benchmarks for 10 common indications for CT identified by the European Society of Radiology. • Observed US radiation doses were higher than European for 9 of 10 indications (except chronic sinusitis). • The presented diagnostic reference levels and median doses highlight potentially unnecessary variation in radiation dose.
The increasing accumulation of healthcare data provides researchers with ample opportunities to build machine learning approaches for clinical decision support and to improve the quality of health care. Several studies have developed conventional machine learning approaches that rely heavily on manual feature engineering and result in task-specific models for health care. In contrast, healthcare researchers have begun to use deep learning, which has emerged as a revolutionary machine learning technique that obviates manual feature engineering but still achieves impressive results in research fields such as image classification. However, few of them have addressed the lack of the interpretability of deep learning models although interpretability is essential for the successful adoption of machine learning approaches by healthcare communities.
In addition, the unique characteristics of healthcare data such as high dimensionality and temporal dependencies pose challenges for building models on healthcare data. To address these challenges, we develop a gated recurrent unit-based recurrent neural network with hierarchical attention for mortality prediction, and then, using the diagnostic codes from the Medical Information Mart for Intensive Care, we evaluate the model. We find that the prediction accuracy of the model outperforms baseline models and demonstrate the interpretability of the model in visualizations.
Background
Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning.
Results
We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster.
Conclusions
The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.
Background. A phase II design with an option for direct assignment (stop randomization and assign all patients to experimental treatment based on interim analysis, IA) for a predefined subgroup was previously proposed. Here, we illustrate the modularity of the direct assignment option by applying it to the setting of two predefined subgroups and testing for separate subgroup main effects. Methods. We power the 2-subgroup direct assignment option design with 1 IA (DAD-1) to test for separate subgroup main effects, with assessment of power to detect an interaction in a post-hoc test. Simulations assessed the statistical properties of this design compared to the 2-subgroup balanced randomized design with 1 IA, BRD-1. Different response rates for treatment/control in subgroup 1 (0.4/0.2) and in subgroup 2 (0.1/0.2, 0.4/0.2) were considered. Results. The 2-subgroup DAD-1 preserves power and type I error rate compared to the 2-subgroup BRD-1, while exhibiting reasonable power in a post-hoc test for interaction. Conclusion. The direct assignment option is a flexible design component that can be incorporated into broader design frameworks, while maintaining desirable statistical properties, clinical appeal, and logistical simplicity.