|
Frequently Asked Questions
Whole-Genome Patterns of Common DNA Variation
in Three Human Populations
David A. Hinds, Laura L. Stuve, Geoffrey B. Nilsen, Eran Halperin,
Eleazar Eskin, Dennis B. Ballinger, Kelly A. Frazer, David R. Cox
Are there specific genetic markers that can tell a scientist what
race a person belongs to?
Recent work has shown that while there clearly are gradients in allele
frequencies that are associated with geographical origin, that there
is no evidence for sharp boundaries that can be used to assign people
to groups that correspond to "races" (Serre and Paabo, 2004). Our
data does not really shed much light on this issue. While we sampled
individuals from three self- described populations, and observed that
by integrating data from many markers we could distinguish between
these groups, the discrete structure we saw largely reflects the fact
that we chose individuals whose ancestors came from very distant parts
of the world. Our ability to group these 71 individuals does not mean
that we could equally easily distinguish among all other individuals
with the same self-described ancestry, or distinguish them from other
human populations. It is not even clear that the question is well
formed, because "race" does not have a clear scientific
interpretation.
Where can I obtain the gene annotation data corresponding to your
analyses?
NCBI's Build 34.3 annotation data is archived
here. Specifically, we used the "gene.q.gz" and "seq_gene.md.gz"
tables. The chromosomal positions in these files are consistent with
the ones in our supplementary tables.
The Version 2 browser is missing some analyses from Version 1.
Will they be updated for Build 36?
Our priority for the Version 2 browser was to make
available the most useful elements of Perlegen's public datasets,
while also not being duplicative of other public genome resources.
The Version 1 browser will be preserved for archival purposes but
that dataset and those analyses are less useful in the context of
more recent work.
You seem to be missing many SNPs that are present in dbSNP. Why
is that?
Between Perlegen's data and the HapMap data, the Version 2 browser
includes a substantial proportion of dbSNP, but by no means all of it.
The Version 1 browser only covers the ~1.5 million SNPs for which we
released data in our 2005 Science paper. The browsers are
provided to facilitate use of these Perlegen datasets, and while we
have imported some additional annotations from public sources, they
are no substitute for full featured browsers like NCBI's Map Viewer or
the UCSC Genome Browser.
How do I open the supplementary data tables?
The supplementary data files are Unix-style compressed plain text. To
uncompress in Windows, use a tool like WinZip (
http://www.winzip.com). Or, on
essentially any operating system, use the command-line program gzip
(http://www.gzip.org). On Windows,
the files can be viewed with most programs other than Notepad.
Wordpad works fine, and the files can be imported into Excel.
The genotype data for the Y chromosome seems to be truncated.
Why is that?
This file only includes columns of genotypes for the 33 male
individuals, listed in the header on the first line.
What algorithms were used to determine linkage disequilibrium bins
and haplotype blocks?
The algorithms are described in the Supporting Online Material
accompanying our paper in Science, available
here.
How can I identify tagging SNPs for the linkage disequilibrium bins?
In the Version 1 browser, the detail view for an LD bin includes a
"tagging" value for each SNP, which has a value of 1 for SNPs that tag
that bin. The complete data is included in the LD map data tables, in
the data download section of the web site. In the Version 2 browser,
we show "footprint maps" for the particular set of tags chosen for use
our work with the Genetic Association Information Network (GAIN).
In the supplementary tables of FST estimates, why are
some values outside the range of 0 to 1?
We computed FST using an unbiased small-sample estimator
from Cockerham and Weir (1984). More specifically, we used their
formula for the "random union of gametes" on the top of p. 1363. In
order to be unbiased, the expected mean value of the estimator across
multiple draws from the same population should equal the true value of
FST. When the true FST is close to 0, the
estimated value from a small sample will vary around that value, and
hence will sometimes be negative.
|