About this item:

144 Views | 87 Downloads

Author Notes:

Correspondence address. Zhaohui Qin. 1518 Clifton Rd, NE, Atlanta, GA, 30322. Email: zhaohui.qin@emory.edu and Fusheng Wang, CS2314D, StonyBrook,NY, 11794. fusheng.wang@stonybrook.edu

J.G. introduced the problem.

X.S. and F.W. initiated the project.

X.S. designed and implemented the CloudMerge project.

X.S. drafted the manuscript.

X.S., J.P., F.W., and Z.Q. revised the manuscript.

K.C.B. conceived the initial consortium design, acquired biospecimens for Next Generation Sequencing (NGS), and facilitated generation of NGS data.

K.C.B., R.A.M., I.R., and T.H.B. conceived initial experiments and interpreted NGS data.

E.G.B. and C.E. acquired biospecimens for NGS and facilitated generation of NGS data.

We thank the three referees for their constructive critiques and detailed comments.

We are grateful to Mary Taylor Mann and Alyssa Leann Duck for their editorial help during writing and revising of the manuscript.

E.G.B. acknowledges the following GALA II and SAGE co-investigators for subject recruitment, sample processing, and quality control: Sandra Salazar, Scott Huntsman, MSc, Donglei Hu, PhD, Lisa Caine, Shannon Thyne, MD, Harold J. Farber, MD, MSPH, Pedro C. Avila, MD, Denise Serebrisky, MD, William Rodriguez-Cintron, MD, Jose R. Rodriguez-Santana, MD, Rajesh Kumar, MD, Luisa N. Borrell, DDS, PhD, Emerita Brigino-Buenaventura, MD, Adam Davis, MA, MPH, Michael A. LeNoir, MD, Kelley Meade, MD, Saunak Sen, PhD, and Fred Lurmann, MS.

Ethics approval for the CAAPA program was provided by the Johns Hopkins University Institutional Review Board (IRB) following commencement of the study in 2011 (IRB00045892, CAAPA ) and included study team members from each CAAPA site, including Emory University (site PI, Zhaohui Qin).

Access to the raw data as CAAPA team members is granted according to the guideline of the IRB-approved study.

Informed consent has been obtained from all study participants of CAAPA.

The authors declare that they have no competing interests.


Research Funding:

This study was supported by grants from the National Heart, Lung, and Blood Institute (R01HL104608, R01HL117004, R01HL128439, R01HL135156, X01HL134589); National Institute of Environmental Health Sciences (R01ES015794, R21ES24844); National Institute on Minority Health and Health Disparities (P60MD006902, R01MD010443, RL5GM118984); National Institute of Neurological Disorders and Stroke (R01NS051630, P01NS097206, U54NS091859); National Science Foundation (ACI 1443054, IIS 1350885); Tobacco-Related Disease Research Program (24RT-0025).

The Genes-Environments and Admixture in Latino Americans (GALA II) Study; the Study of African Americans, Asthma, Genes and Environments (SAGE) Study; and E.G.B. are supported by the Sandler Family Foundation, the American Asthma Foundation, the RWJF Amos Medical Faculty Development Program, and the Harry Wm. and Diana V. Hind Distinguished Professor in Pharmaceutical Sciences II.


  • Science & Technology
  • Multidisciplinary Sciences
  • Science & Technology - Other Topics
  • sorted merging
  • whole-genome sequencing
  • MapReduce
  • Hadoop
  • HBase
  • Spark
  • TOOL

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

Show all authors Show less authors


Journal Title:



Volume 7, Number 6


Type of Work:

Article | Final Publisher PDF


Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

Copyright information:

© The Authors 2018. Published by Oxford University Press.

This is an Open Access work distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).
Export to EndNote