Human mobility data is useful for various applications in urban planning, transportation, and public health, but collecting and sharing real-world trajectories can be challenging due to privacy and data quality issues. To address these problems, recent research focuses on generating synthetic trajectories, mainly using generative adversarial networks (GANs) trained by real-world trajectories. In this paper, we hypothesize that by explicitly capturing the modality of transportation (e.g., walking, biking, driving), we can generate not only more diverse and representative trajectories for different modalities but also more realistic trajectories that preserve the geographical density, trajectory, and transition level properties by capturing both cross-modality and modality-specific patterns. Towards this end, we propose a Clustering-based Sequence Generative Adversarial Network (CSGAN)1 that simultaneously clusters the trajectories based on their modalities and learns the essential properties of real-world trajectories to generate realistic and representative synthetic trajectories. To measure the effectiveness of generated trajectories, in addition to typical density and trajectory level statistics, we define several new metrics for a comprehensive evaluation, including modality distribution and transition probabilities both globally and within each modality. Our extensive experiments with real-world datasets show the superiority of our model in various metrics over state-of-the-art models.
Federated Learning, as a popular paradigm for collaborative training, is vulnerable against privacy attacks. Different privacy levels regarding users’ attitudes need to be satisfied locally, while a strict privacy guarantee for the global model is also required centrally. Personalized Local Differential Privacy (PLDP) is suitable for preserving users’ varying local privacy, yet only provides a central privacy guarantee equivalent to the worst-case local privacy level. Thus, achieving strong central privacy as well as personalized local privacy with a utility-promising model is a challenging problem. In this work, a general framework (APES) is built up to strengthen model privacy under personalized local privacy by leveraging the privacy amplification effect of the shuffle model. To tighten the privacy bound, we quantify the heterogeneous contributions to the central privacy user by user. The contributions are characterized by the ability of generating “echos” from the perturbation of each user, which is carefully measured by proposed methods Neighbor Divergence and Clip-Laplace Mechanism. Furthermore, we propose a refined framework (S-APES) with the post-sparsification technique to reduce privacy loss in high-dimension scenarios. To the best of our knowledge, the impact of shuffling on personalized local privacy is considered for the first time. We provide a strong privacy amplification effect, and the bound is tighter than the baseline result based on existing methods for uniform local privacy. Experiments demonstrate that our frameworks ensure comparable or higher accuracy for the global model.
Atrial fibrillation (AF) is the most common cardiac arrhythmia as well as a significant risk factor in heart failure and coronary artery disease. AF can be detected by using a short ECG recording. However, discriminating atrial fibrillation from normal sinus rhythm, other arrhythmia and strong noise, given a short ECG recording, is challenging. Towards this end, we propose MultiFusionNet, a deep learning network that uses a multiplicative fusion method to combine two deep neural networks trained on different sources of knowledge, i.e., extracted features and raw data. Thus, MultiFusionNet can exploit the relevant extracted features to improve upon the utilization of the deep learning model on the raw data. Our experiments show that this approach offers the most accurate AF classification and outperforms recently published algorithms that either use extracted features or raw data separately. Finally, we show that our multiplicative fusion method for combining the two sub-networks outperforms several other combining methods.
Background
Genomic data have been collected by different institutions and companies and need to be shared for broader use. In a cross-site genomic data sharing system, a secure and transparent access control audit module plays an essential role in ensuring the accountability. A centralized access log audit system is vulnerable to the single point of attack and also lack transparency since the log could be tampered by a malicious system administrator or internal adversaries. Several studies have proposed blockchain-based access audit to solve this problem but without considering the efficiency of the audit queries. The 2018 iDASH competition first track provides us with an opportunity to design efficient logging and querying system for cross-site genomic dataset access audit. We designed a blockchain-based log system which can provide a light-weight and widely compatible module for existing blockchain platforms. The submitted solution won the third place of the competition. In this paper, we report the technical details in our system.
Methods
We present two methods: baseline method and enhanced method. We started with the baseline method and then adjusted our implementation based on the competition evaluation criteria and characteristics of the log system. To overcome obstacles of indexing on the immutable Blockchain system, we designed a hierarchical timestamp structure which supports efficient range queries on the timestamp field.
Results
We implemented our methods in Python3, tested the scalability, and compared the performance using the test data supplied by competition organizer. We successfully boosted the log retrieval speed for complex AND queries that contain multiple predicates. For the range query, we boosted the speed for at least one order of magnitude. The storage usage is reduced by 25%.
Conclusion
We demonstrate that Blockchain can be used to build a time and space efficient log and query genomic dataset audit trail. Therefore, it provides a promising solution for sharing genomic data with accountability requirement across multiple sites.
Differential privacy has recently emerged in private statistical data release as one of the strongest privacy guarantees. Releasing synthetic data that mimic original data with differential privacy provides a promising way for privacy preserving data sharing and analytics while providing a rigorous privacy guarantee. However, to this date there is no open-source tools that allow users to generate differentially private synthetic data, in particular, for high dimensional and large domain data. Most of the existing techniques that generate differentially private histograms or synthetic data only work well for single dimensional or low-dimensional histograms. They become problematic for high dimensional and large domain data due to increased perturbation error and computation complexity. We propose DPSynthesizer, a toolkit for differentially private data synthesization. The core of DPSynthesizer is DPCopula designed for highdimensional and large-domain data. DPCopula computes a differentially private copula function from which synthetic data can be sampled. Copula functions are used to describe the dependence between multivariate random vectors and allow us to build the multivariate joint distribution using one-dimensional marginal distributions. DPSynthesizer also implements a set of state-of-the-art methods for building di fferentially private histograms, suitable for low-dimensional data, from which synthetic data can be generated. We will demonstrate the system using DPCopula as well as other methods with various data sets and show the feasibility, utility, and efficiency of various methods.
Head and neck squamous cell carcinoma (HNSCC) is one of the most common types of human cancer and frequently metastasizes to LNs. Identifying metastasis-promoting factors is of immense clinical interest, as the prognosis for patients with even a single unilateral LN metastasis is extremely poor. Here, we report that p90 ribosomal S6 kinase 2 (RSK2) promotes human HNSCC cell invasion and metastasis. We determined that RSK2 was overexpressed and activated in highly invasive HNSCC cell lines compared with poorly invasive cell lines. Expression of RSK2 also correlated with metastatic progression in patients with HNSCC. Ectopic expression of RSK2 substantially enhanced the invasive capacity of HNSCC cells, while inhibition of RSK2 activity led to marked attenuation of invasion in vitro. Additionally, shRNA knockdown of RSK2 substantially reduced the invasive and metastatic potential of HNSCC cells in vitro and in vivo in a xenograft mouse model, respectively. Mechanistically, we determined that cAMP-responsive element-binding protein (CREB) and Hsp27 are phosphorylated and activated by RSK2 and are important for the RSK2-mediated invasive ability of HNSCC cells. Our findings suggest that RSK2 is involved in the prometastatic programming of HNSCC cells, through phosphorylation of proteins in a putative signaling network. Moreover, targeting RSK2 markedly attenuates in vitro invasion and in vivo metastasis of HNSCC cells, suggesting that RSK2 may represent a therapeutic target in the treatment of metastatic HNSCC.
The outsourcing of genomic data into public cloud computing settings raises concerns over privacy and security. Significant advancements in secure computation methods have emerged over the past several years, but such techniques need to be rigorously evaluated for their ability to support the analysis of human genomic data in an efficient and cost-effective manner. With respect to public cloud environments, there are concerns about the inadvertent exposure of human genomic data to unauthorized users. In analyses involving multiple institutions, there is additional concern about data being used beyond agreed research scope and being prcoessed in untrused computational environments, which may not satisfy institutional policies. To systematically investigate these issues, the NIH-funded National Center for Biomedical Computing iDASH (integrating Data for Analysis, 'anonymization' and SHaring) hosted the second Critical Assessment of Data Privacy and Protection competition to assess the capacity of cryptographic technologies for protecting computation over human genomes in the cloud and promoting cross-institutional collaboration. Data scientists were challenged to design and engineer practical algorithms for secure outsourcing of genome computation tasks in working software, whereby analyses are performed only on encrypted data. They were also challenged to develop approaches to enable secure collaboration on data from genomic studies generated by multiple organizations (e.g., medical centers) to jointly compute aggregate statistics without sharing individual-level records. The results of the competition indicated that secure computation techniques can enable comparative analysis of human genomes, but greater efficiency (in terms of compute time and memory utilization) are needed before they are sufficiently practical for real world environments.
by
Tsung-Ting Kuo;
Tyler Bath;
Shuaicheng Ma;
Nicholas Pattengale;
Meng Yang;
Yao Cao;
Corey M Hudson;
Jihoon Kim;
Kai Post;
Li Xiong;
Lucila Ohno-Machado
Background: Blockchain distributed ledger technology is just starting to be adopted in genomics and healthcare applications. Despite its increased prevalence in biomedical research applications, skepticism regarding the practicality of blockchain technology for real-world problems is still strong and there are few implementations beyond proof-of-concept. We focus on benchmarking blockchain strategies applied to distributed methods for sharing records of gene-drug interactions. We expect this type of sharing will expedite personalized medicine. Basic Procedures: We generated gene-drug interaction test datasets using the Clinical Pharmacogenetics Implementation Consortium (CPIC) resource. We developed three blockchain-based methods to share patient records on gene-drug interactions: Query Index, Index Everything, and Dual-Scenario Indexing. Main Findings: We achieved a runtime of about 60 s for importing 4,000 gene-drug interaction records from four sites, and about 0.5 s for a data retrieval query. Our results demonstrated that it is feasible to leverage blockchain as a new platform to share data among institutions. Principal Conclusions: We show the benchmarking results of novel blockchain-based methods for institutions to share patient outcomes related to gene-drug interactions. Our findings support blockchain utilization in healthcare, genomic and biomedical applications. The source code is publicly available at https://github.com/tsungtingkuo/genedrug.
Representation learning on static graph-structured data has shown a significant impact on many real-world applications. However, less attention has been paid to the evolving nature of temporal networks, in which the edges are often changing over time. The embeddings of such temporal networks should encode both graph-structured information and the temporally evolving pattern. Existing approaches in learning temporally evolving network representations fail to capture the temporal interdependence. In this paper, we propose Toffee, a novel approach for temporal network representation learning based on tensor decomposition. Our method exploits the tensor-tensor product operator to encode the cross-time information, so that the periodic changes in the evolving networks can be captured. Experimental results demonstrate that Toffee outperforms existing methods on multiple real-world temporal networks in generating effective embeddings for the link prediction tasks.
Tensor factorization has been proved as an efficient unsupervised learning approach for health data analysis, especially for computational phenotyping, where the high-dimensional Electronic Health Records (EHRs) with patients history of medical procedures, medications, diagnosis, lab tests, etc., are converted to meaningful and interpretable medical concepts. Federated tensor factorization distributes the tensor computation to multiple workers under the coordination of a central server, which enables jointly learning the phenotypes across multiple hospitals while preserving the privacy of the patient information. However, existing federated tensor factorization algorithms encounter the single-point-failure issue with the involvement of the central server, which is not only easily exposed to external attacks, but also limits the number of clients sharing information with the server under restricted uplink bandwidth. In this paper, we propose CiderTF, a communication-efficient decentralized generalized tensor factorization, which reduces the uplink communication cost by leveraging a four-level communication reduction strategy designed for a generalized tensor factorization, which has the flexibility of modeling different tensor distribution with multiple kinds of loss functions. Experiments on two real-world EHR datasets demonstrate that CiderTF achieves comparable convergence with the communication reduction up to 99.99%.