Background:
The most widely used technologies for profiling microbial communities are 16S marker-gene sequencing and shotgun metagenomic sequencing. Interestingly, many microbiome studies have performed both sequencing experiments on the same cohort of samples. The two sequencing datasets often reveal consistent patterns of microbial signatures, highlighting the potential for an integrative analysis to improve power of testing these signatures. However, differential experimental biases, partially overlapping samples, and differential library sizes pose tremendous challenges when combining the two datasets. Currently, researchers either discard one dataset entirely or use different datasets for different objectives.
Methods:
In this article, we introduce the first method of this kind, named Com-2seq, that combines the two sequencing datasets for testing differential abundance at the genus and community levels while overcoming these difficulties. The new method is based on our LOCOM model (Hu et al., 2022), which employs logistic regression for testing taxon differential abundance while remaining robust to experimental bias. To benchmark the performance of Com-2seq, we introduce two ad hoc approaches: applying LOCOM to pooled taxa count data and combining LOCOM p-values from analyzing each dataset separately.
Results:
Our simulation studies indicate that Com-2seq substantially improves statistical efficiency over analysis of either dataset alone and works better than the two ad hoc approaches. An application of Com-2seq to two real microbiome studies uncovered scientifically plausible findings that would have been missed by analyzing individual datasets.
Conclusions:
Com-2seq performs integrative analysis of 16S and metagenomic sequencing data, which improves statistical efficiency and has the potential to accelerate the search of microbial communities and taxa that are involved in human health and diseases.
Microbiome data are subject to experimental bias that is caused by DNA extraction and PCR amplification, among other sources, but this important feature is often ignored when developing statistical methods for analyzing microbiome data. McLaren, Willis, and Callahan (2019) proposed a model for how such biases affect the observed taxonomic profiles; this model assumes the main effects of bias without taxon–taxon interactions. Our newly developed method for testing the differential abundance of taxa, LOCOM, is the first method to account for experimental bias and is robust to the main effect biases. However, there is also evidence for taxon–taxon interactions. In this report, we formulated a model for interaction biases and used simulations based on this model to evaluate the impact of interaction biases on the performance of LOCOM as well as other available compositional analysis methods. Our simulation results indicate that LOCOM remained robust to a reasonable range of interaction biases. The other methods tend to have an inflated FDR even when there were only main effect biases. LOCOM maintained the highest sensitivity even when the other methods could not control the FDR. We thus conclude that LOCOM outperforms the other methods for compositional analysis of microbiome data considered here.
Background: Gut microbiome dysbiosis has been demonstrated in subjects with newly diagnosed and chronic inflammatory bowel disease (IBD). In this study we sought to explore longitudinal changes in dysbiosis and ascertain associations between dysbiosis and markers of disease activity and treatment outcome. Methods: We performed a prospective cohort study of 19 treatment-naïve pediatric IBD subjects and 10 healthy controls, measuring fecal calprotectin and assessing the gut microbiome via repeated stool samples. Associations between clinical characteristics and the microbiome were tested using generalized estimating equations. Random forest classification was used to predict ultimate treatment response (presence of mucosal healing at follow-up colonoscopy) or non-response using patients' pretreatment samples. Results: Patients with Crohn's disease had increased markers of inflammation and dysbiosis compared to controls. Patients with ulcerative colitis had even higher inflammation and dysbiosis compared to those with Crohn's disease. For all cases, the gut microbial dysbiosis index associated significantly with clinical and biological measures of disease severity, but did not associate with treatment response. We found differences in specific gut microbiome genera between cases/controls and responders/non-responders including Akkermansia, Coprococcus, Fusobacterium, Veillonella, Faecalibacterium, and Adlercreutzia. Using pretreatment microbiome data in a weighted random forest classifier, we were able to obtain 76.5 % accuracy for prediction of responder status. Conclusions: Patient dysbiosis improved over time but persisted even among those who responded to treatment and achieved mucosal healing. Although dysbiosis index was not significantly different between responders and non-responders, we found specific genus-level differences. We found that pretreatment microbiome signatures are a promising avenue for prediction of remission and response to treatment.
Proper control of confounding due to population stratification is crucial for valid analysis of case-control association studies. Fine matching of cases and controls based on genetic ancestry is an increasingly popular strategy to correct for such confounding, both in genome-wide association studies (GWAS) as well as studies that employ next-generation sequencing, where matching can be used when selecting a subset of participants from a GWAS for rare-variant analysis. Existing matching methods match on measures of genetic ancestry that combine multiple components of ancestry into a scalar quantity. However, we show that including non-confounding ancestry components in a matching criterion can lead to inaccurate matches, and hence to an improper control of confounding. To resolve this issue, we propose a novel method that assigns cases and controls to matched strata based on the stratification score (Epstein et al., 2007, AJHG: 80: 921–930), which is the probability of disease given genomic variables. Matching on the stratification score leads to more accurate matches because case participants are matched to control participants who have a similar risk of disease given ancestry information. We illustrate our matching method using the African-American arm of the GAIN GWAS of schizophrenia. In this study, we observe that confounding due to stratification that can be resolved by our matching approach but not by other existing matching procedures. We also use simulated data to show our novel matching approach can provide a more appropriate correction for population stratification than existing matching approaches.
by
Joan E. Bailey-Wilson;
Laura Almasy;
Mariza de Andrade;
Julia Bailey;
Heike Bickeboller;
Heather J. Cordell;
E. Warwick Daw;
Lynn Goldin;
Ellen L. Goode;
Courtney Gray-McGuire;
Wayne Hening;
Gail Jarvik;
Brion S. Maher;
Nancy Mendell;
Andrew D. Paterson;
John Rice;
Glen Satten;
Brian Suarez;
Veronica Vieland;
Marsha Wilcox;
Heping Zhang;
Andreas Ziegler;
Jean W. MacCluer
Genetic Analysis Workshop 14 (GAW14) was held September 7-10, 2005, in Noordwijkerhout, The Netherlands. The overarching theme was the comparison of microsatellite and single-nucleotide polymorphism (SNP) markers for genome-wide scans, and the statistical methods that can best exploit the information provided in such scans for linkage and association studies. The 183 contributions submitted to GAW14 were organized into 18 presentation groups of 7-15 papers each. GAW14 participants had the choice of two data sets to analyze. COGA generously donated extensive family data on alcoholism, related phenotypes, pertinent covariates, and a set of previously analyzed genomescan microsatellite marker genotypes. The simulated data were designed to have similarities to the real data set. Extensive new SNP genotyping was performed on DNA provided by COGA for the previously genotyped families. Illumina, Affymetrix, and the Center for Inherited Disease Research performed this work and donated these data to GAW14 and to COGA. The group summary papers collected in this issue present the major findings from GAW14.
by
Carol S. Rubin;
Adrianne K. Holmes;
Martin G. Belson;
Robert L. Jones;
W Dana Flanders;
Stephanie M. Kieszak;
John Osterloh;
George Luber;
Benjamin C. Blount;
Dana Boyd Barr;
Karen K. Steinberg;
Glen Satten;
Michael A. McGeehin;
Randall L. Todd
Background. Sixteen children diagnosed with acute leukemia between 1997 and 2002 lived in Churchill County, Nevada, at the time of or before their illness. Considering the county population and statewide cancer rate, fewer than two cases would be expected. Objectives. In March 2001, the Centers for Disease Control and Prevention led federal, state, and local agencies in a cross-sectional, case-comparison study to determine if ongoing environmental exposures posed a health risk to residents and to compare levels of contaminants in environmental and biologic samples collected from participating families. Methods. Surveys with more than 500 variables were administered to 205 people in 69 families. Blood, urine, and cheek cell samples were collected and analyzed for 139 chemicals, eight viral markers, and several genetic polymorphisms. Air, water, soil, and dust samples were collected from almost 80 homes to measure more than 200 chemicals. Results. The scope of this cancer cluster investigation exceeded any previous study of pediatric leukemia. Nonetheless, no exposure consistent with leukemia risk was identified. Overall, tungsten and arsenic levels in urine and water samples were significantly higher than national comparison values; however, levels were similar among case and comparison groups. Conclusions. Although the cases in this cancer cluster may in fact have a common etiology, their small number and the length of time between diagnosis and our exposure assessment lessen the ability to find an association between leukemia and environmental exposures. Given the limitations of individual cancer cluster investigations, it may prove more efficient to pool laboratory and questionnaire data from similar leukemia clusters.
Appreciation of the importance of the microbiome is increasing, as sequencing technology has made it possible to ascertain the microbial content of a variety of samples. Studies that sequence the 16S rRNA gene, ubiquitous in and nearly exclusive to bacteria, have proliferated in the medical literature. After sequences are binned into operational taxonomic units (OTUs) or species, data from these studies are summarized in a data matrix with the observed counts from each OTU for each sample. Analysis often reduces these data further to a matrix of pairwise distances or dissimilarities; plotting the first two or three principal components (PCs) of this distance matrix often reveals meaningful groupings in the data. However, once the distance matrix is calculated, it is no longer clear which OTUs or species are important to the observed clustering; further, the PCs are hard to interpret and cannot be calculated for subsequent observations. We show how to construct approximate decompositions of the data matrix that pair PCs with linear combinations of OTU or species frequencies, and show how these decompositions can be used to construct biplots, select important OTUs and partition the variability in the data matrix into contributions corresponding to PCs of an arbitrary distance or dissimilarity matrix. To illustrate our approach, we conduct an analysis of the bacteria found in 45 smokeless tobacco samples.
A novel mass spectral fingerprinting and proteomics approach using MALDI-TOF MS was applied to detect and identify protein biomarkers of group A Streptococcus (GAS) strains. Streptococcus pyogenes ATCC 700294 genome strain was compared with eight GAS clinical isolates to explore the ability of MALDI-TOF MS to differentiate isolates. Reference strains of other bacterial species were also analyzed and compared with the GAS isolates. MALDI preparations were optimized by varying solvents, matrices, plating techniques, and mass ranges for S. pyogenes ATCC 700294. Spectral variability was tested. A subset of common, characteristic, and reproducible biomarkers in the range of 2000-14 000 Da were detected, and they appeared to be independent of the culture media. Statistical analysis confirmed method reproducibility. Random Forest analysis of all selected GAS isolates revealed differences among most of them, and summed spectra were used for hierarchical cluster analysis. Specific biomarkers were found for each strain, and invasive GAS isolates could be differentiated. GAS isolates from cases of necrotizing fasciitis were clustered together and were distinct from isolates associated with noninvasive infections, despite their sharing the same emm type. Almost 30% of the biomarkers detected were tentatively identified as ribosomal proteins.
Currently there is a great deal of interest in developing methods for testing the role that rare variation plays in disease development. Here we propose a weighted association test that accumulates genetic variation across a signaling pathway. We evaluate our approach by analyzing simulated phenotype data from an exome sequencing study of 697 unrelated individuals from the Genetic Analysis Workshop 17 (GAW17) data set. Although our weighted approach identifies several interesting pathways associated with phenotype Q1, so does an alternative unweighted accumulation approach. Such a result is not unexpected because there is no systematic relationship between the allele frequency of a variant and its effect on phenotype in the GAW17 simulation model.
We present computationally simple association tests based on haplotype sharing that can be easily applied to genome-wide association studies, while allowing use of fast (but not likelihood-based) haplotyping algorithms, and properly accounting for the uncertainty introduced by using inferred haplotypes. We also give haplotype sharing analyses that adjust for population stratification. We apply our methods to a genome-wide association study of rheumatoid arthritis available as Problem 1 of Genetic Analysis Workshop 16. In addition to the HLA region on chromosome 6, we find genome-wide significant signals at 7q33 and 13q31.3. These regions contain genes with interesting potential connections with rheumatoid arthritis and are not identified using single single-nucleotide polymorphism methods.