Skip to navigation Skip to content
  • Woodruff
  • Business
  • Health Sciences
  • Law
  • MARBL
  • Oxford College
  • Theology
  • Schools
    • Undergraduate

      • Emory College
      • Oxford College
      • Business School
      • School of Nursing

      Community

      • Emory College
      • Oxford College
      • Business School
      • School of Nursing
    • Graduate

      • Business School
      • Graduate School
      • School of Law
      • School of Medicine
      • School of Nursing
      • School of Public Health
      • School of Theology
  • Libraries
    • Libraries

      • Robert W. Woodruff
      • Business
      • Chemistry
      • Health Sciences
      • Law
      • MARBL
      • Music & Media
      • Oxford College
      • Theology
    • Library Tools

      • Course Reserves
      • Databases
      • Digital Scholarship (ECDS)
      • discoverE
      • eJournals
      • Electronic Dissertations
      • EmoryFindingAids
      • EUCLID
      • ILLiad
      • OpenEmory
      • Research Guides
  • Resources
    • Resources

      • Administrative Offices
      • Emory Healthcare
      • Academic Calendars
      • Bookstore
      • Campus Maps
      • Shuttles and Parking
      • Athletics: Emory Eagles
      • Arts at Emory
      • Michael C. Carlos Museum
      • Emory News Center
      • Emory Report
    • Resources

      • Emergency Contacts
      • Information Technology (IT)
      • Outlook Web Access
      • Office 365
      • Blackboard
      • OPUS
      • PeopleSoft Financials: Compass
      • Careers
      • Human Resources
      • Emory Alumni Association
  • Browse
    • Works by Author
    • Works by Journal
    • Works by Subject
    • Works by Dept
    • Faculty by Dept
  • For Authors
    • How to Submit
    • Deposit Advice
    • Author Rights
    • Publishing Your Data
    • FAQ
    • Emory Open Access Policy
    • Open Access Fund
  • About OpenEmory
    • About OpenEmory
    • About Us
    • Citing Articles
    • Contact Us
    • Privacy Policy
    • Terms of Use
 
Contact Us

Filter Results:

Year

  • 2017 (1)
  • 2018 (1)

Author

  • Qin, Zhaohui (2)
  • Wang, Fusheng (2)
  • Barnes, Kathleen (1)
  • Beaty, Terri H. (1)
  • Burchard, Esteban G. (1)
  • Chen, Li (1)
  • Eng, Celeste (1)
  • Gao, Jingjing (1)
  • Jiang, Xiaoqian (1)
  • Jin, Peng (1)
  • Mathias, Rasika A. (1)
  • Pittard, William S (1)
  • Ruczinski, Ingo (1)
  • Xu, Tianlei (1)
  • Zwick, Michael (1)

Subject

  • Biology, Biostatistics (1)
  • Biology, Genetics (1)
  • Computer Science (1)
  • Engineering, Biomedical (1)
  • Health Sciences, Epidemiology (1)
  • Health Sciences, Public Health (1)

Journal

  • GigaScience (1)
  • Nucleic Acids Research (1)

Keyword

  • genom (2)
  • read (2)
  • technolog (2)
  • align (1)
  • archiv (1)
  • associ (1)
  • biochemistri (1)
  • biolog (1)
  • biomedicin (1)
  • burrow (1)
  • burrowswheel (1)
  • cloud (1)
  • cluster (1)
  • hadoop (1)
  • hbase (1)
  • human (1)
  • life (1)
  • map (1)
  • mapreduc (1)
  • merg (1)
  • metagenom (1)
  • molecular (1)
  • multidisciplinari (1)
  • other (1)
  • profil (1)
  • recruit (1)
  • reduc (1)
  • refer (1)
  • rna (1)
  • rnaseq (1)
  • seq (1)
  • sequenc (1)
  • sort (1)
  • spark (1)
  • tool (1)
  • topic (1)
  • transform (1)
  • wheeler (1)
  • whole (1)
  • wholegenom (1)

Author affiliation

  • Secondary Appointment: Department of Pediatrics (1)

Author department

  • Biostatistics (2)
  • BMI: Admin (1)

Search Results for all work with filters:

  • Sun, Xiaobo
  • data
  • scienc
  • Hum Gen: Admin

Work 1-2 of 2

Sorted by relevance

Article

Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files

by Xiaobo Sun; Jingjing Gao; Peng Jin; Celeste Eng; Esteban G. Burchard; Terri H. Beaty; Ingo Ruczinski; Rasika A. Mathias; Kathleen Barnes; Fusheng Wang; Zhaohui Qin

2018

Subjects
  • Biology, Biostatistics
  • Computer Science
  • Health Sciences, Epidemiology
  • File Download
  • View Abstract

Abstract:Close

Background: Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings: In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions: Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

Article

Omicseq: a web-based search engine for exploring omics datasets

by Xiaobo Sun; William S Pittard; Tianlei Xu; Li Chen; Michael Zwick; Xiaoqian Jiang; Fusheng Wang; Zhaohui Qin

2017

Subjects
  • Biology, Genetics
  • Engineering, Biomedical
  • Health Sciences, Public Health
  • File Download
  • View Abstract

Abstract:Close

The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve â € findability' of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant.
Site Statistics
  • 16,812
  • Total Works
  • 3,630,310
  • Downloads
  • 1,106,221
  • Downloads This Year
  • 6,807
  • Faculty Profiles

Copyright © 2016 Emory University - All Rights Reserved
540 Asbury Circle, Atlanta, GA 30322-2870
(404) 727-6861
Privacy Policy | Terms & Conditions

v2.2.8-dev

Contact Us Recent and Popular Items
Download now