About this item:

234 Views | 404 Downloads

Author Notes:

Corresponding author: Mark P. Styczynski. Address: 311 Ferst Drive NW, Atlanta, GA 30332-0100, USA, Mark.Styczynski@chbe.gatech.edu.

WY and MPS conceived of the computational pipeline, designed the experiments, and wrote the manuscript.

WY implemented the pipeline and performed the experiments.

MRG and AM designed and supervised the animal experiments providing the samples for the transcriptional dataset.

JCK designed and supervised the curation and storage of the transcriptional dataset.

The authors thank all of the members of MaHPIC for their contributions to the project that helped enable the generation of the dataset used in this work. They also thank Zachary Johnson and the Yerkes Genomics Core for performing the sequencing for the transcriptional data, and Aleksey Zimin and Rob Norgren for providing annotated M. mulatta genome sequence for the transcriptional data.


Research Funding:

This project has been funded in whole or in part with federal funds from the National Institute of Allergy and Infectious Diseases; National Institutes of Health, Department of Health and Human Services [contract no. HHSN272201200031C].


  • Science & Technology
  • Life Sciences & Biomedicine
  • Biology
  • Mathematical & Computational Biology
  • Bayesian network inference
  • Large-scale data analysis
  • Model development
  • Infectious diseases
  • Malaria

From genome-scale data to models of infectious disease: A Bayesian network-based strategy to drive model development


Journal Title:

Mathematical Biosciences


Volume 270, Number Pt B


, Pages 156-168

Type of Work:

Article | Post-print: After Peer Review


High-throughput, genome-scale data present a unique opportunity to link host to pathogen on a molecular level. Forging such connections will help drive the development of mathematical models to better understand and predict both pathogen behavior and the epidemiology of infectious diseases, including malaria. However, the datasets that can aid in identifying these links and models are vast and not amenable to simple, reductionist, and univariate analyses. These datasets require data mining in order to identify the truly important measurements that best describe clinical and molecular observations. Moreover, these datasets typically have relatively few samples due to experimental limitations (particularly for human studies or in vivo animal experiments), making data mining extremely difficult. Here, after first providing a brief overview of common strategies for data reduction and identification of relationships between variables for inclusion in mathematical models, we present a new generalized strategy for performing these data reduction and relationship inference tasks. Our approach emphasizes the importance of robustness when using data to drive model development, particularly when using genome-scale, small-sample in vivo data. We identify the use of appropriate feature reduction combined with data permutations and subsampling strategies as being critical to enable increasingly robust results from network inference using high-dimensional, low-observation data.

Copyright information:

© 2015 Published by Elsevier B.V.

This is an Open Access work distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Creative Commons License

Export to EndNote