Compositional analysis is based on the premise that a relatively small proportion of taxa are differentially abundant, while the ratios of the relative abundances of the remaining taxa remain unchanged. Most existing methods use log-transformed data, but log-transformation of data with pervasive zero counts is problematic, and these methods cannot always control the false discovery rate (FDR). Further, high-throughput microbiome data such as 16S amplicon or metagenomic sequencing are subject to experimental biases that are introduced in every step of the experimental workflow. McLaren et al. [eLife 8, e46923 (2019)] have recently proposed a model for how these biases affect relative abundance data. Motivated by this model, we show that the odds ratios in a logistic regression comparing counts in two taxa are invariant to experimental biases. With this motivation, we propose logistic compositional analysis (LOCOM), a robust logistic regression approach to compositional analysis, that does not require pseudocounts. Inference is based on permutation to account for overdispersion and small sample sizes. Traits can be either binary or continuous, and adjustment for confounders is supported. Our simulations indicate that LOCOM always preserved FDR and had much improved sensitivity over existing methods. In contrast, analysis of composition of microbiomes (ANCOM) and ANCOM with bias correction (ANCOM-BC)/ANOVA-Like Differential Expression tool (ALDEx2) had inflated FDR when the effect sizes were small and large, respectively. Only LOCOM was robust to experimental biases in every situation. The flexibility of our method for a variety of microbiome studies is illustrated by the analysis of data from two microbiome studies. Our R package LOCOM is publicly available.
Modern statistical analyses often involve testing large numbers of hypotheses. In many situations, these hypotheses may have an underlying tree structure that both helps determine the order that tests should be conducted but also imposes a dependency between tests that must be accounted for. Our motivating example comes from testing the association between a trait of interest and groups of microbes that have been organized into operational taxonomic units (OTUs) or amplicon sequence variants (ASVs). Given p-values from association tests for each individual OTU or ASV, we would like to know if we can declare a certain species, genus, or higher taxonomic group to be associated with the trait. For this problem, a bottom-up testing algorithm that starts at the lowest level of the tree (OTUs or ASVs) and proceeds upward through successively higher taxonomic groupings (species, genus, family, etc.) is required. We develop such a bottom-up testing algorithm that controls a novel error rate that we call the false selection rate. By simulation, we also show that our approach is better at finding driver taxa, the highest level taxa below which there are dense association signals. We illustrate our approach using data from a study of the microbiome among patients with ulcerative colitis and healthy controls. Supplementary materials for this article are available online.
It is known that data from both 16S and shotgun metagenomics studies are subject to biases that cause the observed relative abundances of taxa to differ from their true values. Model community analyses, in which the relative abundances of all taxa in the sample are known by construction, seem to offer the hope that these biases can be measured. However, it is unclear whether the bias we measure in a mock community analysis is the same as we measure in a sample in which taxa are spiked in at known relative abundance, or if the biases we measure in spike-in samples is the same as the bias we would measure in a real (e.g., biological) sample. Here, we consider these questions in the context of 16S rRNA measurements on three sets of samples: the commercially available Zymo cells model community; the Zymo model community mixed with Swedish Snus, a smokeless tobacco product that is virtually bacteria-free; and a set of commercially available smokeless tobacco products. Each set of samples was subject to four different extraction protocols. The goal of our analysis is to determine whether the patterns of bias observed in each set of samples are the same, i.e., can we learn about the bias in the commercially available smokeless tobacco products by studying the Zymo cells model community?
Finding microbiome associations with possibly censored survival times is an important problem, especially as specific taxa could serve as biomarkers for disease prognosis or as targets for therapeutic interventions. The two existing methods for survival outcomes, MiRKAT-S and OMiSA, are restricted to testing associations at the community level and do not provide results at the individual taxon level. An ad hoc approach testing each taxon with a survival outcome using the Cox proportional hazard model may not perform well in the microbiome setting with sparse count data and small sample sizes. Methods We have previously developed the linear decomposition model (LDM) for testing continuous or discrete outcomes that unifies community-level and taxon-level tests into one framework. Here we extend the LDM to test survival outcomes. We propose to use the Martingale residuals or the deviance residuals obtained from the Cox model as continuous covariates in the LDM. We further construct tests that combine the results of analyzing each set of residuals separately. Finally, we extend PERMANOVA, the most commonly used distance-based method for testing community-level hypotheses, to handle survival outcomes in a similar manner. Results Using simulated data, we showed that the LDM-based tests preserved the false discovery rate for testing individual taxa and had good sensitivity. The LDM-based community-level tests and PERMANOVA-based tests had comparable or better power than MiRKAT-S and OMiSA. An analysis of data on the association of the gut microbiome and the time to acute graft-versus-host disease revealed several dozen associated taxa that would not have been achievable by any community-level test, as well as improved community-level tests by the LDM and PERMANOVA over those obtained using MiRKAT-S and OMiSA. Conclusions Unlike existing methods, our new methods are capable of discovering individual taxa that are associated with survival times, which could be of important use in clinical settings.
Mediation models are a set of statistical techniques that investigate the mechanisms that produce an observed relationship between an exposure variable and an outcome variable in order to deduce the extent to which the relationship is influenced by intermediate mediator variables. For a case-control study, the most common mediation analysis strategy employs a counterfactual framework that permits estimation of indirect and direct effects on the odds ratio scale for dichotomous outcomes, assuming either binary or continuous mediators. While this framework has become an important tool for mediation analysis, we demonstrate that we can embed this approach in a unified likelihood framework for mediation analysis in case-control studies that leverages more features of the data (in particular, the relationship between exposure and mediator) to improve efficiency of indirect effect estimates. One important feature of our likelihood approach is that it naturally incorporates cases within the exposure-mediator model to improve efficiency. Our approach does not require knowledge of disease prevalence and can model confounders and exposure-mediator interactions, and is straightforward to implement in standard statistical software. We illustrate our approach using both simulated data and real data from a case-control genetic study of lung cancer.
DNA methylation (DNAm) plays diverse roles in human biology, but this dynamic epigenetic mark remains far from fully characterized. Although earlier studies uncovered loci that undergo age-associated DNAm changes in adults, little is known about such changes during childhood. Despite profound DNAm plasticity during embryogenesis, monozygotic twins show indistinguishable childhood methylation, suggesting that DNAm is highly coordinated throughout early development. Here we examine the methylation of 27,578 CpG dinucleotides in peripheral blood DNA from a cross-sectional study of 398 boys, aged 3–17 yr, and find significant age-associated changes in DNAm at 2078 loci. These findings correspond well with pyrosequencing data and replicate in a second pediatric population (N = 78). Moreover, we report a deficit of age-related loci on the X chromosome, a preference for specific nucleotides immediately surrounding the interrogated CpG dinucleotide, and a primary association with developmental and immune ontological functions. Meta-analysis (N = 1158) with two adult populations reveals that despite a significant overlap of age-associated loci, most methylation changes do not follow a lifelong linear pattern due to a threefold to fourfold higher rate of change in children compared with adults; consequently, the vast majority of changes are more accurately modeled as a function of logarithmic age. We therefore conclude that age-related DNAm changes in peripheral blood occur more rapidly during childhood and are imperfectly accounted for by statistical corrections that are linear in age, further suggesting that future DNAm studies should be matched closely for age.
Background: Matched-set data arise frequently in microbiome studies. For example, we may collect pre- and post-treatment samples from a set of individuals, or use important confounding variables to match data from case participants to one or more control participants. Thus, there is a need for statistical methods for data comprised of matched sets, to test hypotheses against traits of interest (e.g., clinical outcomes or environmental factors) at the community level and/or the operational taxonomic unit (OTU) level. Optimally, these methods should accommodate complex data such as those with unequal sample sizes across sets, confounders varying within sets, and continuous traits of interest. Methods: PERMANOVA is a commonly used distance-based method for testing hypotheses at the community level. We have also developed the linear decomposition model (LDM) that unifies the community-level and OTU-level tests into one framework. Here we present a new strategy that can be used with both PERMANOVA and the LDM for analyzing matched-set data. We propose to include an indicator variable for each set as covariates, so as to constrain comparisons between samples within a set, and also permute traits within each set, which can account for exchangeable sample correlations. The flexible nature of PERMANOVA and the LDM allows discrete or continuous traits or interactions to be tested, within-set confounders to be adjusted, and unbalanced data to be fully exploited. Results: Our simulations indicate that our proposed strategy outperformed alternative strategies, including the commonly used one that utilizes restricted permutation only, in a wide range of scenarios. Using simulation, we also explored optimal designs for matched-set studies. The flexibility of PERMANOVA and the LDM for a variety of matched-set microbiome data is illustrated by the analysis of data from two real studies. Conclusions: Including set indicator variables and permuting within sets when analyzing matched-set data with PERMANOVA or the LDM is a strategy that performs well and is capable of handling the complex data structures that frequently occur in microbiome studies. [MediaObject not available: see fulltext.].
by
Robert E Tyxobert;
Angel J Rivera;
Glen Satten;
Lisa M Keong;
Peter Kuklenyik;
Grace E Lee;
Tameka S Lawler;
Jacob B Kimbrell;
Stephen B Stanfill;
Liza Valentin-Blasini;
Clifford H Watson
Background Smokeless tobacco (ST) products are widely used throughout the world and contribute to morbidity and mortality in users through an increased risk of cancers and oral diseases. Bacterial populations in ST contribute to taste, but their presence can also create carcinogenic, Tobacco-Specific N-nitrosamines (TSNAs). Previous studies of microbial communities in tobacco products lacked chemistry data (e.g. nicotine, TSNAs) to characterize the products and identify associations between carcinogen levels and taxonomic groups. This study uses statistical analysis to identify potential associations between microbial and chemical constituents in moist snuff products. Methods We quantitatively analyzed 38 smokeless tobacco products for TSNAs using liquid chromatography with tandem mass spectrometry (LC-MS/MS), and nicotine using gas chromatography with mass spectrometry (GC-MS). Moisture content determinations (by weight loss on drying), and pH measurements were also performed. We used 16S rRNA gene sequencing to characterize the microbial composition, and additionally measured total 16S bacterial counts using a quantitative PCR assay. Results Our findings link chemical constituents to their associated bacterial populations. We found core taxonomic groups often varied between manufacturers. When manufacturer and flavor were controlled for as confounding variables, the genus Lactobacillus was found to be positively associated with TSNAs. while the genera Enteractinococcus and Brevibacterium were negatively associated. Three genera (Corynebacterium, Brachybacterium, and Xanthomonas) were found to be negatively associated with nicotine concentrations. Associations were also investigated separately for products from each manufacturer. Products from one manufacturer had a positive association between TSNAs and bacteria in the genus Marinilactibacillus. Additionally, we found that TSNA levels in many products were lower compared with previously published chemical surveys. Finally, we observed consistent results when either relative or absolute abundance data were analyzed, while results from analyses of log-ratio-transformed abundances were divergent.
Objective: To evaluate the association between the early pregnancy vaginal microbiome and spontaneous preterm birth (sPTB) and early term birth (sETB) among African American women. Methods: Vaginal samples collected in early pregnancy (8-14 weeks’ gestation) from 436 women enrolled in the Emory University African American Vaginal, Oral, and Gut Microbiome in Pregnancy Study underwent 16S rRNA gene sequencing of the V3-V4 region, taxonomic classification, and community state type (CST) assignment. We compared vaginal CST and abundance of taxa for women whose pregnancy ended in sPTB (N = 44) or sETB (N= 84) to those who delivered full term (N = 231). Results: Nearly half of the women had a vaginal microbiome classified as CST IV (Diverse CST), while one-third had CST III (L. iners dominated) and just 16% had CST I, II, or V (non-iners Lactobacillus dominated). Compared to vaginal CST I, II, or V (non-iners Lactobacillus dominated), both CST III (L. iners dominated) and CST IV (Diverse) were associated with sPTB with an adjusted odds ratio (95% confidence interval) of 4.1 (1.1, infinity) and 7.7 (2.2, infinity), respectively, in multivariate logistic regression. In contrast, no vaginal CST was associated with sETB. The linear decomposition model (LDM) based on amplicon sequence variant (ASV) relative abundance found a significant overall effect of the vaginal microbiome on sPTB (p=0.034) but not sETB (p=0.320), whereas the LDM based on presence/absence of ASV found no overall effect on sPTB (p=0.328) but a significant effect on sETB (p=0.030). In testing for ASV-specific effects, the LDM found that no ASV was significantly associated with sPTB considering either relative abundance or presence/absence data after controlling for multiple comparisons (FDR 10%), although in marginal analysis the relative abundance of Gardnerella vaginalis (p=0.011), non-iners Lactobacillus (p=0.016), and Mobiluncus curtisii (p=0.035) and the presence of Atopobium vaginae (p=0.049), BVAB2 (p=0.024), Dialister microaerophilis (p=0.011), and Prevotella amnii (p=0.044) were associated with sPTB. The LDM identified the higher abundance of 7 ASVs and the presence of 13 ASVs, all commonly residents of the gut, as associated with sETB at FDR < 10%. Conclusions: In this cohort of African American women, an early pregnancy vaginal CST III or IV was associated with an increased risk of sPTB but not sETB. The relative abundance and presence of distinct taxa within the early pregnancy vaginal microbiome was associated with either sPTB or sETB.
Schizophrenia (SZ) is a severe psychiatric illness that affects ∼1% of the population and has a strong genetic underpinning. Recently, genome-wide analysis of copy-number variation (CNV) has implicated rare and de novo events as important in SZ. Here, we report a genome-wide analysis of 245 SZ cases and 490 controls, all of Ashkenazi Jewish descent. Because many studies have found an excess burden of large, rare deletions in cases, we limited our analysis to deletions over 500 kb in size. We observed seven large, rare deletions in cases, with 57% of these being de novo. We focused on one 836 kb de novo deletion at chromosome 3q29 that falls within a 1.3–1.6 Mb deletion previously identified in children with intellectual disability (ID) and autism, because increasing evidence suggests an overlap of specific rare copy-number variants (CNVs) between autism and SZ. By combining our data with prior CNV studies of SZ and analysis of the data of the Genetic Association Information Network (GAIN), we identified six 3q29 deletions among 7545 schizophrenic subjects and one among 39,748 controls, resulting in a statistically significant association with SZ (p = 0.02) and an odds ratio estimate of 17 (95% confidence interval: 1.36–1198.4). Moreover, this 3q29 deletion region contains two linkage peaks from prior SZ family studies, and the minimal deletion interval implicates 20 annotated genes, including PAK2 and DLG1, both paralogous to X-linked ID genes and now strong candidates for SZ susceptibility.