Table of Contents 1
Note S1 – Data set 2-6
Note S2 – Ancestry estimates 7-8
Note S3 – Ancestry Subtraction to address European and African admixture 9-13
Note S4 – Masking segments of non-Native ancestry 14-16
Note S5 – Correlation of genetic diversity with distance from the Bering Strait 17-18
Note S6 – Documentation of at least three streams of Asian gene flow into America 19-24
Note S7 – Modeling the peopling of America 25-33
Figure S1 – Sampling locations of 17 Siberian populations 34
Figure S2 – Masking of segments of non-Native ancestry 35
Figure S3 – Trees are consistent for masked and unadmixed samples 36
Figure S4 – Admixture Graphs are consistent for masked and unadmixed samples 37
Figure S5 – Heterozygosity and distance from the Bering Strait 38
Table S1 – Summary data for 52 Native American populations 39
Table S2 – Summary data for 17 Siberian populations 40
Table S3 – Individual data for 493 Native American samples 41-47
WWW.NATURE.COM/NATURE | 1
SUPPLEMENTARY INFORMATION
doi:10.1038/nature11258
Note S1
Data set
(i) Merging data from seven sources
We merged seven sets of samples genotyped on Illumina SNP arrays. The number of samples
we started with from each population (prior to the final data curation detailed below) is
summarized in Table S1.1. Datasets other than the one obtained for this study were pre-
filtered by other researchers or in previous rounds of data curation carried out by the authors.
Table S1.1: Illumina genotyping data sets that we merged for this analysis
Name of dataset N* Comments
“This study”
(American and Siberian) 343
Genotyping was performed on Illumina 610-Quad arrays using a combination
of genomic and whole genome amplified DNA. The genotyping was
performed at the Broad Institute, with the exception of 10 of the 15
Chipewyan samples genotyped at McGill. The initial dataset was pre-filtered
to eliminate samples that were genotyped twice, where genotypes were
inconsistent with a DNA fingerprint, or where the call rate was <90% (later
filters raised this to <95%). We restricted to autosomal SNPs, and removed
SNPs with call rate <95% or no physical position.
“Kidd”
(American and Siberian) 154 Genotyping was performed on Illumina 650Y arrays.
“MGDP”
(Mexican1) 83
Genotyping was performed on Illumina HumanHap550 V3.0 arrays. We
restricted to individuals inferred to be unrelated up to 2nd degree relatives.
“DiRienzo”
(Siberian) 63
Genotyping was performed on either Illumina 610-Quad arrays (Nganasan
and Yukaghir) or Illumina 650Y arrays (Naukan and Chukchi)2.
“Willerslev”
(Arctic) 142
Genotyping was performed on Illumina 650Y arrays3. We included all
samples from ref. 3 except the Na-Dene which did not have permissions
appropriate for this study. We then excluded the Yukaghir and Naukan where
so many were lost in initial data curation that we removed the whole sample.
“HapMap3”
(Worldwide) 799
Genotyping was performed on Illumina 1M and Affymetrix 6.0 arrays4. (The
Illumina 1M contains essentially all the SNPs in the Illumina 610-Quad array
so we are effectively using the Illumina 1M data from the HapMap3
genotyping.) We removed the Masai (MKK) which had a PCA pattern
showing high within-population relatedness.
“CEPH-HGDP”
(Worldwide) 907
Genotyping was performed on Illumina 650Y arrays5. We restricted to
individuals inferred to be unrelated up to second degree relatives prior to
carrying out the additional data curation steps reported below6.
* The sample size quoted here is what we analyzed prior to the final data curation steps reported below.
(ii) Curation of Native American samples
Our curation excluded samples that genotyped poorly or that had an unusual genetic
background relative to other samples from the same population. We first ran the HAPMIX
local ancestry inference software (Note S4) to identify segments of the genome in Native
Americans and Siberians that may harbor West Eurasian or African ancestry. We then treated
the genotypes in these segments as if they were missing data. This “masking” allowed us to
better analyze the samples that had some recent European or African ancestry. The estimates
of European and African ancestry, and proportion of the genome that was masked, are
presented by population in Table S1 for Native Americans and Table S2 for Siberians. The
individual ancestry estimates for the Native American samples are presented in Table S3.
WWW.NATURE.COM/NATURE | 2
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature11258
We applied the following filters to remove 114 Native Americans samples from the dataset:
(1) 18 samples were removed due to a high missing genotype rate
We required that every sample had a genotyping missing data rate of <5%.
(2) 32 samples were removed due to a high proportion of West Eurasian or African mixture
We removed samples with <22% of their genomes inferred to have both alleles of entirely
Native American ancestry based on the masking analysis of Note S4. The only exception
was in Aleutian Islanders where this would have removed all of the samples.
(3) 44 samples were removed due to excess or deficiency of heterozygotes vs. expectation
All the Karitiana from the Kidd genotyping had a significant excess of heterozygous
genotypes compared with the allele frequency computed in the same samples (violation of
Hardy-Weinberg equilibrium). We removed these samples. We also removed a handful of
additional samples due to heterozygote excess or deficiency.
(4) 10 samples were removed due to evidence of being at least a 2nd degree relative to others
It has already been reported that the Surui sample contained relatives6. For all pairs of
individuals in all populations that had evidence for >22% of their genome being shared,
we removed one of the pair (in general we chose to remove the one with more missing
data). For this purpose, we used SMARTREL, part of the EIGENSOFT package7.
(5) 5 samples were removed due to a noisy local ancestry analysis
A total of 5 samples showed a strong mismatch between the ADMIXTURE-based
estimate of European and African ancestry proportion (Note S2), and the proportion of
the genome that was masked based on HAPMIX local ancestry analysis (Note S4). Visual
inspection of the HAPMIX-based local ancestry inference for these 5 showed a noisy
baseline ancestry inference compared with other individuals from the same populations,
with narrow spikes of potential (but non-confident) non-Native American ancestry, which
we interpreted as evidence for poor genotyping. We removed these samples.
(6) 5 samples were removed as PCA outliers relative to others from the same population
To identify samples that had unusual genotyping properties relative to other from their
own populations we used Principal Component Analysis (PCA) as implemented in
EIGENSOFT7. The outlier removal was based on the masked data (Note S4). To ensure
that we were not removing samples simply because they had high proportions of their
genome masked, we filled in missing data for each SNP based on the mean allele
frequency of other samples in the same population (the filled-in data was only used in
outlier removal; not for analyses of history). We performed outlier removal restricting to
populations with at least 3 samples (outlier removal is impossible with fewer samples),
and divided the populations into four groupings to make visual inspection tractable:
northern North Americans, Meso-Americans, northern South Americans, and southern
South Americans. We iteratively removed samples that were outliers relative to others
from the same population on significant eigenvectors, until the samples appeared
homogeneous. Aleuts were not included in outlier removal, as masking left almost none
of their genome; however, we did remove one Aleut who from local ancestry analysis,
appeared to have one chromosome from unadmixed, non-Aleut Native Americans.
After data curation, the number of Native Americans in the merged dataset was 493 (Table
S1.2 reports the number of samples removed by population). Importantly, the data curation
procedure was based on searching for individuals that were outliers with respect to their own
population. Thus, if our curation introduces bias, it would be to make populations more
homogeneous; we do not expect it to bias inferences of relationships among populations.
WWW.NATURE.COM/NATURE | 3
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature11258
Table S1.2: Record of Native American data curation: filtering from 607 to 493 samples
Population Study B
ef
or
e
A
fte
r
Population Study B
ef
or
e
A
fte
r
Population Study B
ef
or
e
A
fte
r
Aleutian Willerslev 9 8 Guarani This 9 6 Piapoco HGDP 7 7
Algonquin This 5 5 Guaymi This 5 5 Pima HGDP/Kidd 46 33
Arara This 2 1 Huetar This 2 1 Purepecha This 1 1
Arhuaco This 6 5 Hulliche This 4 4 Quechua This 41 40
Aymara This 24 23 Inga This 13 9 Surui HGDP/Kidd 30 24
Bribri This 4 4 Jamamadi This 2 1 Tepehuano MGDP 27 25
Cabecar This 32 31 Kaingang This 2 2 Teribe This 3 3
Chane This 2 2 Kaqchikel This 18 13 Ticuna This 6 6
Chilote This 10 8 Karitiana HGDP/Kidd 34 13 Toba This 5 4
Chipewyan This 15 15 Kogi This 6 4 Waunana This 5 3
Chono This 4 4 Maleku This 4 3 Wayuu This 17 11
Chorotega This 1 1 Maya1&2 HGDP/MGDP 56 49 WGInuit Willerslev 8 8
Cree This 5 4 Mixe This 20 17 Wichi This 5 5
Diaguita This 5 5 Mixtec This 5 5 Yaghan This 4 4
EGInuit Willerslev 7 7 Ojibwa This 5 5 Yaqui This 1 1
Embera This 6 5 Palikur This 3 3 Zapotec1&2 This/MGDP 59 43
Guahibo This 13 6 Parakana This 4 1
* The Maya and Zapotec are broken into two subgroups for our analyses in the paper (e.g. Maya1 and Maya2).
Table S1.3: Record of Siberian data curation: filtering from 264 to 245 samples
Population Study Be
fo
re
A
fte
r
Population Study Be
fo
re
A
fte
r
Altaian Willerslev 13 12 Mongolian Willerslev 9 8
Buryat Willerslev 18 17 Naukan DiRienzo 16 16
Chukchi DiRienzo/Willerslev 30 30 Nganasan1&2 DiRienzo/Willerslev 24 22
Dolgan Willerslev 6 4 Selkup Willerslev 9 9
Evenki Willerslev 15 15 Tundra_Nentsi This 4 3
Ket Willerslev 2 2 Tuvinians Willerslev 16 15
Khanty Kidd 39 35 Yakut HGDP/Kidd 40 34
Koryak Willerslev 10 10 Yukaghir Di Rienzo 13 13
* The Nganasan are broken into two subgroups for our analyses in the paper (Nganasan1 and Nganasan2).
(iv) Curation of Siberian data
We performed a similar analysis in the Siberian populations. This resulted in 17 Siberian
populations, after splitting the Nganasan into two based on the two sources of the samples
(Willerslev and DiRienzo; the structure was correlated to the sample source, suggesting that
these two studies may have sampled different subgroups of the same population). We do not
report on the Naukan and Yukaghir populations from the Willerslev dataset in Table S1.3
because so few samples were left from each after outlier removal; we thus removed these
populations entirely from the analysis. Table S1.3 summarizes the filtering by population:
• 2 samples were removed due to evidence of being at least a 2nd degree relative to others.
• 17 samples were removed due to being outliers in PCA relative to their own population.
(v) Curation of non-Native American, non-Siberian data
We also performed PCA to remove outlier samples from non-Native American and non-
Siberian populations. We removed the entire MKK population4 (Masai from Kenya from
HapMap3) because of many statistically significant eigenvectors that were difficult to
interpret. We also removed 6 other outlier samples. We started from previously filtered
WWW.NATURE.COM/NATURE | 4
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature11258
datasets, and hence the number of samples prior to filtering reported in Table S1.1 is
sometimes less than that in the papers that originally reported the data.
(vi) Merging and splitting of populations
Four populations were genotyped both by the Kidd and CEPH-HGDP studies but were
known to be from the same original sample collection: Yakut, Karitiana, Surui and Pima. We
removed the Kidd Karitiana data because of evidence for heterozygote excess (see above).
The two Surui, two Pima, and two Yakut samples were indistinguishable based on PCA, and
hence we merged them. The labels we used for the merged data from these populations are:
“Pima” (Kidd Pima and the CEPH-HGDP Pima)
“Surui” (Kidd Surui and CEPH-HGDP Surui)
“Yakut” (Kidd Yakut and CEPH-HGDP Yakut)
We also merged data from the Chukchi and Quechua because the data we had available from
different sources were indistinguishable in PCA:
“Chukchi” (Willerslev Chukchi and DiRienzo Chukchi)
“Quechua” (Quechua data from this study and Kidd Quechua)
There were 4 populations for which data were available from two different sources, and for
which we kept populations separate based on the source of the samples. We kept the samples
separate either because these population samples have been traditionally analyzed separately
(for example HapMap3 YRI and HGDP Yoruba), or because we observed differences
between the two sources of samples from these populations in PCA (which could reflect
genuine population substructure, so we did not want to merge the samples):
Yoruba (“Yoruba” from HGDP; “YRI” from HapMap3)
Mongolian (“Mongolian” from Willerslev; “Mongola” from HGDP)
Nganasan (“Nganasan1” from Willerslev; “Nganasan2” from Di Rienzo)
Zapotec (“Zapotec1” from this study; “Zapotec2” from MGDP)
Finally, PCA showed population substructure in the Maya that did not neatly break down
according to the sample source (HGDP or MGDP). This may reflect real substructure: the
Maya in MGDP were sampled at multiple sites. We therefore repartitioned as follows:
Maya (“Maya1” from HGDP and MGDP; “Maya2” from MGDP)
(vii) Removal of SNPs with inconsistent or potentially problematic genotyping
After merging data for all populations, we curated SNPs as follows:
(1) 16 SNPs were removed due to an excess or deficiency of heterozygous genotypes
6 SNPs in the data collected specifically for this study, 6 in the Kidd data, 3 in the
Willerslev data, and 1 in the CEPH-HGDP data, showed an extreme excess or deficiency
of heterozygotes compared with expectation given the frequency in their populations
(their chi-square statistics were visual outliers from the tail).
(2) 16 SNPs were removed due to inconsistency in frequency across data sets
For all SNPs, we compared the frequency across populations of similar ancestry. We
found 9 SNPs from the genotyping for this study, 6 from HapMap3, and 1 in MGDP,
which were consistently more differentiated from the other data sets than expected from
the tail of the chi-square distribution, suggesting genotyping error. We removed them.
(viii) Final datasets
After curation, we had 2,351 samples and 364,470 autosomal SNPs from 52 Native
American, 17 Siberian, and 57 other populations. The average genotyping completeness was
WWW.NATURE.COM/NATURE | 5
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature11258
99.88% per sample. The final datasets are listed in Table S1.4. The “unmasked” dataset
reflects only the data curation steps described above. The “masked” dataset was obtained
based on the results of running HAPMIX to define segments of potential African or West
Eurasian ancestry due to admixture in the last few hundred years; SNPs in such segments
were then treated as missing (Note S4). All datasets are available on request.
Table S1.4: Six datasets generated for this study
Name Samples SNPs Notes
Unmasked 2,351 364,470 All data
Masked 2,351 364,470 All masked data
unmasked.unadmixed 2,021 364,470 Individuals with no evidence of recent admixture
unmasked.saqqaq 2,352 68,131 All data*
masked.saqqaq 2,352 68,131 All masked data*
unmasked.unadmixed.saqqaq 2,021 68,131 Individuals with no evidence of recent admixture*
Note: All files are in the EIGENSOFT “packedancestrymap” format.
* These files are merged with genotypes that were previously published based on whole-genome sequencing
data from a Saqqaq Paleo-Eskimo individual from Greenland3.
References for Note S1
1 Silva-Zolezzi I, Hidalgo-Miranda A, Estrada-Gil J, Fernandez-Lopez JC, Uribe-Figueroa L, Contreras A,
Balam-Ortiz E, del Bosque-Plata L, Velazquez-Fernandez D, Lara C, Goya R, Hernandez-Lemus E, Davila C,
Barrientos E, March S, Jimenez-Sanchez G (2009) Analysis of genomic diversity in Mexican Mestizo
populations to develop genomic medicine in Mexico. Proc Natl Acad Sci USA 106, 8611-8616.
2 Hancock AM, Witonsky DB, Alkorta-Aranburu G, Beall CM, Gebremedhin A, Sukernik R, Utermann G,
Pritchard JK, Coop G, Di Rienzo A (2011) Adaptations to climate-mediated selective pressures in humans.
PLoS Genet. 7, e1001375.
3 Rasmussen M. et al. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463, 757-762
(2010).
4 International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human
populations. Nature 467, 52-58 (2010).
5 Li J.Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319,
1100-1104 (2008).
6 Rosenberg NA. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel,
accounting for atypical and duplicated samples and pairs of close relatives. Ann Hum Genet. 70, 841-847
(2006).
7 Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet. 2, e190.
WWW.NATURE.COM/NATURE | 6
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature11258
Note S2
Ancestry estimates
Many of the Native American samples in this study have inherited some European and
African genes since 1492. We used the ADMIXTURE clustering software to estimate the
proportion of European and African ancestry in each individual1. Following the
recommendations of the user manual, prior to running the software we thinned the data until
there were no pairs of polymorphisms that had allelic association of r2>0.1, resulting in
88,079 SNPs.
We ran ADMIXTURE on the thinned dataset searching for k=2, 3, 4, 5 and 6 clusters. We
restricted the analysis to populations that we judged were particularly relevant to learning
about Native American population history:
• All Native American populations from this study.
• 5 Siberian populations chosen to be geographically relatively close to the Bering Strait or to
the Arctic and to cluster in PCA with little evidence of recent mixture (Naukan, Chukchi,
Koryak, Nganasan1 and Nganasan2)
• 6 European ancestry populations (French, Italian, Sardinian, Russian, CEU and TSI)
• 3 Niger-Kordofanian speaking, sub-Saharan African populations (Yoruba, YRI and LWK)
For each cluster number (k=2, 3, 4, 5 and 6), we identified the cluster most correlated to
African and European population membership. The assignment to European and African
clusters was extremely highly correlated for k=4 and k=5 (Figure S2.1). The only
discrepancies between the k=4 and k=5 ancestry estimates are for European ancestry in
Nganasan1 and Nganasan2, and thus we did not use the Nganasan in analyses that relied on
ADMIXTURE ancestry estimates (in these analyses, we represented Siberians by the
Naukan, Chukchi and Koryak only). In contrast, the estimates for k=3 were more weakly
correlated to higher cluster numbers (Figure S2.1).
Figure S2.1: ADMIXTURE
European and African ancestry
estimates compared across k=3-5
clusters. We ran ADMIXTURE on
samples from all Native American, 5
Siberian, 6 European and 3 sub-
Saharan African populations. We
plot the components most strongly
correlated with European and
African ancestr
本文档为【Reich2012 Reconstructing Native American population history-s1 (1)】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。