Genome-wide association studies (GWAS) are a popular approach for identifying common genetic variants and epistatic effects associated with a disease phenotype. The traditional statistical analysis of such GWAS attempts to assess the association between each individual single-nucleotide polymorphism (SNP) and the observed phenotype. Recently, kernel machine-based tests for association between a SNP set (e.g., SNPs in a gene) and the disease phenotype have been proposed as a useful alternative to the traditional individual-SNP approach, and allow for flexible modeling of the potentially complicated joint SNP effects in a SNP set while adjusting for covariates. We extend the kernel machine framework to accommodate related subjects from multiple independent families, and provide a score-based variance component test for assessing the association of a given SNP set with a continuous phenotype, while adjusting for additional covariates and accounting for within-family correlation. We illustrate the proposed method using simulation studies and an application to genetic data from the Genetic Epidemiology Network of Arteriopathy (GENOA) study.
Genome-Wide association studies (GWAS) routinely apply principal component analysis (PCA) to infer population structure within a sample to correct for confounding due to ancestry. GWAS implementation of PCA uses tens of thousands of single-nucleotide polymorphisms (SNPs) to infer structure, despite the fact that only a small fraction of such SNPs provides useful information on ancestry. The identification of this reduced set of Ancestry-Informative markers (AIMs) from a GWAS has practical value; for example, researchers can genotype the AIM set to correct for potential confounding due to ancestry in follow-up studies that utilize custom SNP or sequencing technology. We propose a novel technique to identify AIMs from Genome-Wide SNP data using sparse PCA. The procedure uses penalized regression methods to identify those SNPs in a Genome-Wide panel that significantly contribute to the principal components while encouraging SNPs that provide negligible loadings to vanish from the analysis. We found that sparse PCA leads to negligible loss of ancestry information compared to traditional PCA analysis of Genome-Wide SNP data. We further demonstrate the value of sparse PCA for AIM selection using real data from the International HapMap Project and a Genome-Wide study of inflammatory bowel disease. We have implemented our approach in open-source R software for public use.