Estimating Local Ancestry in Admixed Populations

Full text: http://ajhg.org/AJHG/fulltext/S0002-9297(08)00079-7

By the way:
Cell Press is proud to take over the publishing of The American Journal of Human Genetics from the University of Chicago Press. To facilitate the transition all content on the site will be freely available until April 2008. Members of The American Society of Human Genetics will be sent an email explaining how they can activate their online subscription in the middle of February.


The American Journal of Human Genetics, Volume 82, Issue 2, 290-303, 8 February 2008

Estimating Local Ancestry in Admixed Populations

Sriram Sankararaman1, Srinath Sridhar2, Gad Kimmel1 and Eran Halperin

Large-scale genotyping of SNPs has shown a great promise in identifying markers that could be linked to diseases. One of the major obstacles involved in performing these studies is that the underlying population substructure could produce spurious associations. Population substructure can be caused by the presence of two distinct subpopulations or a single pool of admixed individuals. In this work, we focus on the latter, which is significantly harder to detect in practice. New advances in this research direction are expected to play a key role in identifying loci that are different among different populations and are still associated with a disease. We evaluated current methods for inference of population substructure in such cases and show that they might be quite inaccurate even in relatively simple scenarios. We therefore introduce a new method, LAMP (Local Ancestry in adMixed Populations), which infers the ancestry of each individual at every single-nucleotide polymorphism (SNP). LAMP computes the ancestry structure for overlapping windows of contiguous SNPs and combines the results with a majority vote. Our empirical results show that LAMP is significantly more accurate and more efficient than existing methods for inferrring locus-specific ancestries, enabling it to handle large-scale datasets. We further show that LAMP can be used to estimate the individual admixture of each individual. Our experimental evaluation indicates that this extension yields a considerably more accurate estimate of individual admixture than state-of-the-art methods such as STRUCTURE or EIGENSTRAT, which are frequently used for the correction of population stratification in association studies.
[. . .]
The problem of inferring the population substructure is especially challenging when recently admixed populations are involved. In these populations (e.g., African Americans and Latinos), two or more ancestral populations have been mixing for a relatively small number of generations, resulting in a new population in which the ancestry of every individual can be explained by different proportions of the original populations. Because of recombination events, even within the DNA of a single individual, different regions of the genome could originate from different ancestral populations. This adds to the complexity of the problem of finding the ancestral information of an individual because in nonadmixed populations, the whole genome can be used as evidence for the population membership of an individual, whereas in the admixed case, the genome of each individual is fragmented into shorter regions of different ancestry. It is therefore challenging to find the ancestral information of these individuals and, in particular, to find the locus-specific ancestries.
[. . .]
Here, we propose a new method, LAMP (Local Ancestry in adMixed Populations), for de novo estimation of the locus-specific ancestry in recently admixed populations (see Figure 1). Our method is based on the observation that previous methods that use a Hidden Markov Model, or extensions of it, are set to infer a very large set of parameters, including the exact position of the recombination events, making the search over the parameter space infeasible. Instead, our method operates on sliding windows of contiguous SNPs. We first calculate an optimal window length. Next, we use a clustering algorithm that operates on these windows and estimates each individual's ancestry. We then use a majority vote for each SNP, over all windows that overlap with the SNP, in order to decide the most likely ancestral populations at the SNP. This simple approach has two advantages over previous ones. First, we show analytically that the estimates of the algorithm are asymptotically correct across the entire genome. Second, it optimizes fewer parameters than previous methods, and hence the optimization is much faster and more robust than previous methods.

We tested LAMP extensively on various datasets of admixed populations generated from the HapMap resource. Our simulations show that LAMP is significantly more accurate than state-of-the-art methods such as SABER and STRUCTURE. In addition, LAMP is highly efficient, with a running time that is about 200 times faster than SABER and about 104 times faster than STRUCTURE. The efficiency of LAMP allows us to estimate ancestries across the genome in several hours on a single computer.

An additional advantage of LAMP is that unlike previous methods, such as SABER, it does not require the ancestral genotypes to infer the locus-specific ancestries (though it can take advantage of these, if available). This might be crucial when the ancestral genotypes cannot be typed or are unknown. For instance, if one studies the population genetics of populations in remote geographic locations where historical admixing has not been recorded, a method such as LAMP could be used to reveal such recent admixing. Furthermore, even in cases where the history of admixing is known, it is not always possible to genotype all the ancestral populations because some of the subpopulations have become extinct and some have entirely mixed with other populations. On the other hand, as genotypes of major population groups become available, it would be beneficial to use LAMP-ANC (ANC: ancestral), which can take advantage of the pure genotypes.

Surprisingly, we find that in many cases where LAMP does not receive the genotypes of the ancestral populations as input, it performs considerably better than SABER. In particular, on a simulated dataset of African Americans, when measuring the percentage of individuals that are predicted with an accuracy of at least 90%, LAMP achieves high accuracies on 90% of the individuals, whereas SABER and STRUCTURE achieve less than 10%.

Finally, we used LAMP to estimate the individual admixture and showed empirically that this results in much more accurate estimates than methods such as STRUCTURE12 or EIGENSTRAT.2 This reduction in errors might be used to considerably reduce the rate of spurious association results in disease association studies.
[. . .]
We have presented a new method, LAMP, for de novo estimation of locus-specific ancestry in recently admixed populations. Unlike previous methods for locus-specific ancestry (e.g., SABER), LAMP does not use any information about the ancestral populations (i.e., it estimates the ancestries de novo). We show that LAMP is analytically justified and that it achieves significant improvements over existing methods both in terms of accuracy of prediction and speed. In particular, LAMP can easily be applied to whole-genome datasets, and the resulting locus-specific ancestries can be estimated within a few hours.

De novo estimation of the locus-specific ancestries is sometimes infeasible, especially when the ancestral populations are very close to each other (e.g., CHB and JPT). We therefore extended LAMP to a method called LAMP-ANC, which uses additional genotypes from the ancestral populations as priors. This approach has been shown to be useful before by methods such as SABER.
[. . .]
Although LAMP relies on a knowledge of the parameters g and α, we have shown the robustness of the ancestry estimates to inaccuracies in these parameters. These parameters control the window size. As the window size is decreased, each window might contain fewer informative SNPs. On the other hand, errors in classifying individuals who have breakpoints within a window are reduced. This tradeoff is illustrated in Figure 7, where we see that the ancestry estimates are robust when g is overestimated. In practice, we would therefore recommend the use of an upper bound on g when g cannot be estimated accurately. Furthermore, g might actually be a more complex parameter—for example, if some portions of the admixed population have admixed for g1 generations and other portions have been admixed for only g2 generations, where g2 is smaller than g1. In this case, g is set to be g1, and more accurate results are expected than if the whole population has admixed for exactly g1 generation.
[. . .]
A simple extension to LAMP can be used to infer the individual admixture. As we show here, the resulting estimates of the individual admixture are considerably better than the estimates achieved by STRUCTURE or EIGENSTRAT. A number of recent studies have produced panels of AIMs in admixed populations;33, 34, 35, 36 AIMs are SNPs that have differing frequencies in the ancestral populations. It is possible that the AIMs might be used to improve the accuracy of individual admixture prediction done by STRUCTURE or other methods, including LAMP. However, the AIMs have disadvantages because there is a risk of over fitting, and the studied population might be somewhat different than the population for which the AIMs were found. As we show here, in an era where the genotyping technology is getting cheaper, it is useful to use the entire set of genotyped SNPs in the analysis of population stratification.

No comments: