About this item:

177 Views | 315 Downloads

Author Notes:

Corresponding author. R Mitchell Parry: parry@gatech.edu; May D Wang: maywang@bme.gatech.edu

RMP conceived of the least-squares approach to inferring population structure, designed the study, and drafted the document.

MDW initiated the SNP data analysis project, acquired funding to sponsor this effort, and directed the project and publication.

All authors read and approved the final manuscript.

The authors declare that they have no competing interests.


Research Funding:

This work was supported in part by grants from Microsoft Research, National Institutes of Health (Bioengineering Research Partnership R01CA108468, P20GM072069, Center for Cancer Nanotechnology Excellence U54CA119338, and 1RC2CA148265), and Georgia Cancer Coalition (Distinguished Cancer Scholar Award to Professor M. D. Wang).


  • Science & Technology
  • Life Sciences & Biomedicine
  • Biochemical Research Methods
  • Biotechnology & Applied Microbiology
  • Mathematical & Computational Biology
  • Biochemistry & Molecular Biology

A fast least-squares algorithm for population inference


Journal Title:

BMC Bioinformatics


Volume 14


, Pages 28-28

Type of Work:

Article | Final Publisher PDF


Background Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning. Results We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster. Conclusions The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.

Copyright information:

©2013 Parry and Wang; licensee BioMed Central Ltd.

This is an Open Access work distributed under the terms of the Creative Commons Attribution 2.0 Generic License (http://creativecommons.org/licenses/by/2.0/).

Creative Commons License

Export to EndNote