The GenoChip: A New Tool for Genetic Anthropology

Preprint at arXiv:
The Genographic Project is an international effort using genetic data to chart human migratory history. The project is non-profit and non-medical, and through its Legacy Fund supports locally led efforts to preserve indigenous and traditional cultures. In its second phase, the project is focusing on markers from across the entire genome to obtain a more complete understanding of human genetic variation. Although many commercial arrays exist for genome-wide SNP genotyping, they were designed for medical genetic studies and contain medically related markers that are not appropriate for global population genetic studies. GenoChip, the Genographic Project's new genotyping array, was designed to resolve these issues and enable higher-resolution research into outstanding questions in genetic anthropology. We developed novel methods to identify AIMs and genomic regions that may be enriched with alleles shared with ancestral hominins. Overall, we collected and ascertained AIMs from over 450 populations. Containing an unprecedented number of Y-chromosomal and mtDNA SNPs and over 130,000 SNPs from the autosomes and X-chromosome, the chip was carefully vetted to avoid inclusion of medically relevant markers. The GenoChip results were successfully validated. To demonstrate its capabilities, we compared the FST distributions of GenoChip SNPs to those of two commercial arrays for three continental populations. While all arrays yielded similarly shaped (inverse J) FST distributions, the GenoChip autosomal and X-chromosomal distributions had the highest mean FST, attesting to its ability to discern subpopulations. The GenoChip is a dedicated genotyping platform for genetic anthropology and promises to be the most powerful tool available for assessing population structure and migration history.
Let's be clear: the "most powerful tool available for assessing population structure and migration history" is whole genome sequencing. The Genographic Project, which represents a large fraction of the global spending on its type of population genetics research, unnecessarily hobbled itself from the outset in hopes of pre-emptively appeasing rent-seeking shrill self-appointed advocates for "indigenous peoples". I don't think Spencer Wells and company thought they were giving up much, since the short-sighted original plan was to examine only uniparental markers. In that light, perhaps we can be thankful that they've come up with a way of sidestepping the restrictions they placed on themselves and generating at least some useful autosomal data.
Several steps were taken to ensure that the genetic results would not be exploited for pharmaceutical, medical, and biotechnology purposes. First, participant samples were maintained in a completely anonymous status during GenoChip analysis. Second, no phenotypic or medical data were collected from the participants. Third, we included only SNPs in noncoding regions without any known functional association, as reported in dbSNP build 132. Lastly, we filtered our SNP collection against a 1.5 million SNP data set containing all variants that have potential, known, or suspected associations with diseases.
But however they'd like to spin it there's nothing ideal about ignoring "functional" variation or limiting the number of SNPs tested. Razib has a bizarre post up at his Discover blog in which he confuses SNP ascertainment and "Ancestry Informative Marker" ascertainment, and I see that the authors of the paper themselves appear to be eliding the distinction. But the overwhelming majority of the "450 populations" from which "AIMs" were "ascertained" for the GenoChip had merely been typed on existing microarrays -- which goes no ways towards addressing the issue the Affymetrix Human Origins array was designed to address (putting together SNP panels with known ascertainment, starting by sequencing individuals from multiple populations). Ultimately, the most useful and complete picture of human genetic history will come from whole genome sequencing, which should be cheap enough within a few years for use by the Genographic Project. The question is have they permanently handicapped themselves from applying the actual best tool for their stated mission, or will we eventually see at least some whole genome data for their 75,000 indigenous samples (no doubt with at minimum coding regions redacted).

No comments: