Publication

Sparse Principal Component Analysis for Identifying Ancestry-Informative Markers in Genome Wide Association Studies

Downloadable Content

Persistent URL
Last modified
  • 05/15/2025
Type of Material
Authors
    Seokho Lee, Hankuk University of Foreign StudiesMichael Epstein, Emory UniversityRichard Duncan, Emory UniversityXihong Lin, Harvard University
Language
  • English
Date
  • 2012-05-01
Publisher
  • Wiley: 12 months
Publication Version
Copyright Statement
  • © 2012 Wiley Periodicals, Inc.
Final Published Version (URL)
Title of Journal or Parent Work
ISSN
  • 0741-0395
Volume
  • 36
Issue
  • 4
Start Page
  • 293
End Page
  • 302
Grant/Funding Information
  • This work was supported by National Institutes of Health grants R37CA076404 and P01CA134294 for X.L. and S.L. and HG003618 for M.P.E and R.D..
  • And it was also supported for S.L. by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2011-0011608).
Abstract
  • Genome-Wide association studies (GWAS) routinely apply principal component analysis (PCA) to infer population structure within a sample to correct for confounding due to ancestry. GWAS implementation of PCA uses tens of thousands of single-nucleotide polymorphisms (SNPs) to infer structure, despite the fact that only a small fraction of such SNPs provides useful information on ancestry. The identification of this reduced set of Ancestry-Informative markers (AIMs) from a GWAS has practical value; for example, researchers can genotype the AIM set to correct for potential confounding due to ancestry in follow-up studies that utilize custom SNP or sequencing technology. We propose a novel technique to identify AIMs from Genome-Wide SNP data using sparse PCA. The procedure uses penalized regression methods to identify those SNPs in a Genome-Wide panel that significantly contribute to the principal components while encouraging SNPs that provide negligible loadings to vanish from the analysis. We found that sparse PCA leads to negligible loss of ancestry information compared to traditional PCA analysis of Genome-Wide SNP data. We further demonstrate the value of sparse PCA for AIM selection using real data from the International HapMap Project and a Genome-Wide study of inflammatory bowel disease. We have implemented our approach in open-source R software for public use.
Author Notes
  • Xihong Lin, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, Email: xlin@hsph.harvard.edu
Keywords
Research Categories
  • Statistics
  • Biology, Genetics
  • Biology, Biostatistics

Tools

Relations

In Collection:

Items