首页 Reich2012 Reconstructing Native American population history-s1 (1)

Reich2012 Reconstructing Native American population history-s1 (1)

举报
开通vip

Reich2012 Reconstructing Native American population history-s1 (1)   Table of Contents 1 Note S1 – Data set 2-6 Note S2 – Ancestry estimates 7-8 Note S3 – Ancestry Subtraction to address European and African admixture 9-13 Note S4 – Masking segments of non-Native ancestry 14-16 Note S5 – Correlation...

Reich2012 Reconstructing Native American population history-s1 (1)
  Table of Contents 1 Note S1 – Data set 2-6 Note S2 – Ancestry estimates 7-8 Note S3 – Ancestry Subtraction to address European and African admixture 9-13 Note S4 – Masking segments of non-Native ancestry 14-16 Note S5 – Correlation of genetic diversity with distance from the Bering Strait 17-18 Note S6 – Documentation of at least three streams of Asian gene flow into America 19-24 Note S7 – Modeling the peopling of America 25-33     Figure S1 – Sampling locations of 17 Siberian populations 34 Figure S2 – Masking of segments of non-Native ancestry 35 Figure S3 – Trees are consistent for masked and unadmixed samples 36 Figure S4 – Admixture Graphs are consistent for masked and unadmixed samples 37 Figure S5 – Heterozygosity and distance from the Bering Strait 38   Table S1 – Summary data for 52 Native American populations 39 Table S2 – Summary data for 17 Siberian populations 40 Table S3 – Individual data for 493 Native American samples 41-47 WWW.NATURE.COM/NATURE | 1 SUPPLEMENTARY INFORMATION doi:10.1038/nature11258   Note S1 Data set (i) Merging data from seven sources We merged seven sets of samples genotyped on Illumina SNP arrays. The number of samples we started with from each population (prior to the final data curation detailed below) is summarized in Table S1.1. Datasets other than the one obtained for this study were pre- filtered by other researchers or in previous rounds of data curation carried out by the authors. Table S1.1: Illumina genotyping data sets that we merged for this analysis Name of dataset N* Comments “This study” (American and Siberian) 343 Genotyping was performed on Illumina 610-Quad arrays using a combination of genomic and whole genome amplified DNA. The genotyping was performed at the Broad Institute, with the exception of 10 of the 15 Chipewyan samples genotyped at McGill. The initial dataset was pre-filtered to eliminate samples that were genotyped twice, where genotypes were inconsistent with a DNA fingerprint, or where the call rate was <90% (later filters raised this to <95%). We restricted to autosomal SNPs, and removed SNPs with call rate <95% or no physical position. “Kidd” (American and Siberian) 154 Genotyping was performed on Illumina 650Y arrays. “MGDP” (Mexican1) 83 Genotyping was performed on Illumina HumanHap550 V3.0 arrays. We restricted to individuals inferred to be unrelated up to 2nd degree relatives. “DiRienzo” (Siberian) 63 Genotyping was performed on either Illumina 610-Quad arrays (Nganasan and Yukaghir) or Illumina 650Y arrays (Naukan and Chukchi)2. “Willerslev” (Arctic) 142 Genotyping was performed on Illumina 650Y arrays3. We included all samples from ref. 3 except the Na-Dene which did not have permissions appropriate for this study. We then excluded the Yukaghir and Naukan where so many were lost in initial data curation that we removed the whole sample. “HapMap3” (Worldwide) 799 Genotyping was performed on Illumina 1M and Affymetrix 6.0 arrays4. (The Illumina 1M contains essentially all the SNPs in the Illumina 610-Quad array so we are effectively using the Illumina 1M data from the HapMap3 genotyping.) We removed the Masai (MKK) which had a PCA pattern showing high within-population relatedness. “CEPH-HGDP” (Worldwide) 907 Genotyping was performed on Illumina 650Y arrays5. We restricted to individuals inferred to be unrelated up to second degree relatives prior to carrying out the additional data curation steps reported below6. * The sample size quoted here is what we analyzed prior to the final data curation steps reported below. (ii) Curation of Native American samples Our curation excluded samples that genotyped poorly or that had an unusual genetic background relative to other samples from the same population. We first ran the HAPMIX local ancestry inference software (Note S4) to identify segments of the genome in Native Americans and Siberians that may harbor West Eurasian or African ancestry. We then treated the genotypes in these segments as if they were missing data. This “masking” allowed us to better analyze the samples that had some recent European or African ancestry. The estimates of European and African ancestry, and proportion of the genome that was masked, are presented by population in Table S1 for Native Americans and Table S2 for Siberians. The individual ancestry estimates for the Native American samples are presented in Table S3. WWW.NATURE.COM/NATURE | 2 SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature11258   We applied the following filters to remove 114 Native Americans samples from the dataset: (1) 18 samples were removed due to a high missing genotype rate We required that every sample had a genotyping missing data rate of <5%. (2) 32 samples were removed due to a high proportion of West Eurasian or African mixture We removed samples with <22% of their genomes inferred to have both alleles of entirely Native American ancestry based on the masking analysis of Note S4. The only exception was in Aleutian Islanders where this would have removed all of the samples. (3) 44 samples were removed due to excess or deficiency of heterozygotes vs. expectation All the Karitiana from the Kidd genotyping had a significant excess of heterozygous genotypes compared with the allele frequency computed in the same samples (violation of Hardy-Weinberg equilibrium). We removed these samples. We also removed a handful of additional samples due to heterozygote excess or deficiency. (4) 10 samples were removed due to evidence of being at least a 2nd degree relative to others It has already been reported that the Surui sample contained relatives6. For all pairs of individuals in all populations that had evidence for >22% of their genome being shared, we removed one of the pair (in general we chose to remove the one with more missing data). For this purpose, we used SMARTREL, part of the EIGENSOFT package7. (5) 5 samples were removed due to a noisy local ancestry analysis A total of 5 samples showed a strong mismatch between the ADMIXTURE-based estimate of European and African ancestry proportion (Note S2), and the proportion of the genome that was masked based on HAPMIX local ancestry analysis (Note S4). Visual inspection of the HAPMIX-based local ancestry inference for these 5 showed a noisy baseline ancestry inference compared with other individuals from the same populations, with narrow spikes of potential (but non-confident) non-Native American ancestry, which we interpreted as evidence for poor genotyping. We removed these samples. (6) 5 samples were removed as PCA outliers relative to others from the same population To identify samples that had unusual genotyping properties relative to other from their own populations we used Principal Component Analysis (PCA) as implemented in EIGENSOFT7. The outlier removal was based on the masked data (Note S4). To ensure that we were not removing samples simply because they had high proportions of their genome masked, we filled in missing data for each SNP based on the mean allele frequency of other samples in the same population (the filled-in data was only used in outlier removal; not for analyses of history). We performed outlier removal restricting to populations with at least 3 samples (outlier removal is impossible with fewer samples), and divided the populations into four groupings to make visual inspection tractable: northern North Americans, Meso-Americans, northern South Americans, and southern South Americans. We iteratively removed samples that were outliers relative to others from the same population on significant eigenvectors, until the samples appeared homogeneous. Aleuts were not included in outlier removal, as masking left almost none of their genome; however, we did remove one Aleut who from local ancestry analysis, appeared to have one chromosome from unadmixed, non-Aleut Native Americans. After data curation, the number of Native Americans in the merged dataset was 493 (Table S1.2 reports the number of samples removed by population). Importantly, the data curation procedure was based on searching for individuals that were outliers with respect to their own population. Thus, if our curation introduces bias, it would be to make populations more homogeneous; we do not expect it to bias inferences of relationships among populations. WWW.NATURE.COM/NATURE | 3 SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature11258   Table S1.2: Record of Native American data curation: filtering from 607 to 493 samples Population Study B ef or e A fte r Population Study B ef or e A fte r Population Study B ef or e A fte r Aleutian Willerslev 9 8 Guarani This 9 6 Piapoco HGDP 7 7 Algonquin This 5 5 Guaymi This 5 5 Pima HGDP/Kidd 46 33 Arara This 2 1 Huetar This 2 1 Purepecha This 1 1 Arhuaco This 6 5 Hulliche This 4 4 Quechua This 41 40 Aymara This 24 23 Inga This 13 9 Surui HGDP/Kidd 30 24 Bribri This 4 4 Jamamadi This 2 1 Tepehuano MGDP 27 25 Cabecar This 32 31 Kaingang This 2 2 Teribe This 3 3 Chane This 2 2 Kaqchikel This 18 13 Ticuna This 6 6 Chilote This 10 8 Karitiana HGDP/Kidd 34 13 Toba This 5 4 Chipewyan This 15 15 Kogi This 6 4 Waunana This 5 3 Chono This 4 4 Maleku This 4 3 Wayuu This 17 11 Chorotega This 1 1 Maya1&2 HGDP/MGDP 56 49 WGInuit Willerslev 8 8 Cree This 5 4 Mixe This 20 17 Wichi This 5 5 Diaguita This 5 5 Mixtec This 5 5 Yaghan This 4 4 EGInuit Willerslev 7 7 Ojibwa This 5 5 Yaqui This 1 1 Embera This 6 5 Palikur This 3 3 Zapotec1&2 This/MGDP 59 43 Guahibo This 13 6 Parakana This 4 1 * The Maya and Zapotec are broken into two subgroups for our analyses in the paper (e.g. Maya1 and Maya2). Table S1.3: Record of Siberian data curation: filtering from 264 to 245 samples Population Study Be fo re A fte r Population Study Be fo re A fte r Altaian Willerslev 13 12 Mongolian Willerslev 9 8 Buryat Willerslev 18 17 Naukan DiRienzo 16 16 Chukchi DiRienzo/Willerslev 30 30 Nganasan1&2 DiRienzo/Willerslev 24 22 Dolgan Willerslev 6 4 Selkup Willerslev 9 9 Evenki Willerslev 15 15 Tundra_Nentsi This 4 3 Ket Willerslev 2 2 Tuvinians Willerslev 16 15 Khanty Kidd 39 35 Yakut HGDP/Kidd 40 34 Koryak Willerslev 10 10 Yukaghir Di Rienzo 13 13 * The Nganasan are broken into two subgroups for our analyses in the paper (Nganasan1 and Nganasan2). (iv) Curation of Siberian data We performed a similar analysis in the Siberian populations. This resulted in 17 Siberian populations, after splitting the Nganasan into two based on the two sources of the samples (Willerslev and DiRienzo; the structure was correlated to the sample source, suggesting that these two studies may have sampled different subgroups of the same population). We do not report on the Naukan and Yukaghir populations from the Willerslev dataset in Table S1.3 because so few samples were left from each after outlier removal; we thus removed these populations entirely from the analysis. Table S1.3 summarizes the filtering by population: • 2 samples were removed due to evidence of being at least a 2nd degree relative to others. • 17 samples were removed due to being outliers in PCA relative to their own population. (v) Curation of non-Native American, non-Siberian data We also performed PCA to remove outlier samples from non-Native American and non- Siberian populations. We removed the entire MKK population4 (Masai from Kenya from HapMap3) because of many statistically significant eigenvectors that were difficult to interpret. We also removed 6 other outlier samples. We started from previously filtered WWW.NATURE.COM/NATURE | 4 SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature11258   datasets, and hence the number of samples prior to filtering reported in Table S1.1 is sometimes less than that in the papers that originally reported the data. (vi) Merging and splitting of populations Four populations were genotyped both by the Kidd and CEPH-HGDP studies but were known to be from the same original sample collection: Yakut, Karitiana, Surui and Pima. We removed the Kidd Karitiana data because of evidence for heterozygote excess (see above). The two Surui, two Pima, and two Yakut samples were indistinguishable based on PCA, and hence we merged them. The labels we used for the merged data from these populations are: “Pima” (Kidd Pima and the CEPH-HGDP Pima) “Surui” (Kidd Surui and CEPH-HGDP Surui) “Yakut” (Kidd Yakut and CEPH-HGDP Yakut) We also merged data from the Chukchi and Quechua because the data we had available from different sources were indistinguishable in PCA: “Chukchi” (Willerslev Chukchi and DiRienzo Chukchi) “Quechua” (Quechua data from this study and Kidd Quechua) There were 4 populations for which data were available from two different sources, and for which we kept populations separate based on the source of the samples. We kept the samples separate either because these population samples have been traditionally analyzed separately (for example HapMap3 YRI and HGDP Yoruba), or because we observed differences between the two sources of samples from these populations in PCA (which could reflect genuine population substructure, so we did not want to merge the samples): Yoruba (“Yoruba” from HGDP; “YRI” from HapMap3) Mongolian (“Mongolian” from Willerslev; “Mongola” from HGDP) Nganasan (“Nganasan1” from Willerslev; “Nganasan2” from Di Rienzo) Zapotec (“Zapotec1” from this study; “Zapotec2” from MGDP) Finally, PCA showed population substructure in the Maya that did not neatly break down according to the sample source (HGDP or MGDP). This may reflect real substructure: the Maya in MGDP were sampled at multiple sites. We therefore repartitioned as follows: Maya (“Maya1” from HGDP and MGDP; “Maya2” from MGDP) (vii) Removal of SNPs with inconsistent or potentially problematic genotyping After merging data for all populations, we curated SNPs as follows: (1) 16 SNPs were removed due to an excess or deficiency of heterozygous genotypes 6 SNPs in the data collected specifically for this study, 6 in the Kidd data, 3 in the Willerslev data, and 1 in the CEPH-HGDP data, showed an extreme excess or deficiency of heterozygotes compared with expectation given the frequency in their populations (their chi-square statistics were visual outliers from the tail). (2) 16 SNPs were removed due to inconsistency in frequency across data sets For all SNPs, we compared the frequency across populations of similar ancestry. We found 9 SNPs from the genotyping for this study, 6 from HapMap3, and 1 in MGDP, which were consistently more differentiated from the other data sets than expected from the tail of the chi-square distribution, suggesting genotyping error. We removed them. (viii) Final datasets After curation, we had 2,351 samples and 364,470 autosomal SNPs from 52 Native American, 17 Siberian, and 57 other populations. The average genotyping completeness was WWW.NATURE.COM/NATURE | 5 SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature11258   99.88% per sample. The final datasets are listed in Table S1.4. The “unmasked” dataset reflects only the data curation steps described above. The “masked” dataset was obtained based on the results of running HAPMIX to define segments of potential African or West Eurasian ancestry due to admixture in the last few hundred years; SNPs in such segments were then treated as missing (Note S4). All datasets are available on request. Table S1.4: Six datasets generated for this study Name Samples SNPs Notes Unmasked 2,351 364,470 All data Masked 2,351 364,470 All masked data unmasked.unadmixed 2,021 364,470 Individuals with no evidence of recent admixture unmasked.saqqaq 2,352 68,131 All data* masked.saqqaq 2,352 68,131 All masked data* unmasked.unadmixed.saqqaq 2,021 68,131 Individuals with no evidence of recent admixture* Note: All files are in the EIGENSOFT “packedancestrymap” format. * These files are merged with genotypes that were previously published based on whole-genome sequencing data from a Saqqaq Paleo-Eskimo individual from Greenland3. References for Note S1                                                              1 Silva-Zolezzi I, Hidalgo-Miranda A, Estrada-Gil J, Fernandez-Lopez JC, Uribe-Figueroa L, Contreras A, Balam-Ortiz E, del Bosque-Plata L, Velazquez-Fernandez D, Lara C, Goya R, Hernandez-Lemus E, Davila C, Barrientos E, March S, Jimenez-Sanchez G (2009) Analysis of genomic diversity in Mexican Mestizo populations to develop genomic medicine in Mexico. Proc Natl Acad Sci USA 106, 8611-8616. 2 Hancock AM, Witonsky DB, Alkorta-Aranburu G, Beall CM, Gebremedhin A, Sukernik R, Utermann G, Pritchard JK, Coop G, Di Rienzo A (2011) Adaptations to climate-mediated selective pressures in humans. PLoS Genet. 7, e1001375. 3 Rasmussen M. et al. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463, 757-762 (2010). 4 International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52-58 (2010). 5 Li J.Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 1100-1104 (2008). 6 Rosenberg NA. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann Hum Genet. 70, 841-847 (2006). 7 Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet. 2, e190. WWW.NATURE.COM/NATURE | 6 SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature11258   Note S2 Ancestry estimates Many of the Native American samples in this study have inherited some European and African genes since 1492. We used the ADMIXTURE clustering software to estimate the proportion of European and African ancestry in each individual1. Following the recommendations of the user manual, prior to running the software we thinned the data until there were no pairs of polymorphisms that had allelic association of r2>0.1, resulting in 88,079 SNPs. We ran ADMIXTURE on the thinned dataset searching for k=2, 3, 4, 5 and 6 clusters. We restricted the analysis to populations that we judged were particularly relevant to learning about Native American population history: • All Native American populations from this study. • 5 Siberian populations chosen to be geographically relatively close to the Bering Strait or to the Arctic and to cluster in PCA with little evidence of recent mixture (Naukan, Chukchi, Koryak, Nganasan1 and Nganasan2) • 6 European ancestry populations (French, Italian, Sardinian, Russian, CEU and TSI) • 3 Niger-Kordofanian speaking, sub-Saharan African populations (Yoruba, YRI and LWK) For each cluster number (k=2, 3, 4, 5 and 6), we identified the cluster most correlated to African and European population membership. The assignment to European and African clusters was extremely highly correlated for k=4 and k=5 (Figure S2.1). The only discrepancies between the k=4 and k=5 ancestry estimates are for European ancestry in Nganasan1 and Nganasan2, and thus we did not use the Nganasan in analyses that relied on ADMIXTURE ancestry estimates (in these analyses, we represented Siberians by the Naukan, Chukchi and Koryak only). In contrast, the estimates for k=3 were more weakly correlated to higher cluster numbers (Figure S2.1). Figure S2.1: ADMIXTURE European and African ancestry estimates compared across k=3-5 clusters. We ran ADMIXTURE on samples from all Native American, 5 Siberian, 6 European and 3 sub- Saharan African populations. We plot the components most strongly correlated with European and African ancestr
本文档为【Reich2012 Reconstructing Native American population history-s1 (1)】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_968111
暂无简介~
格式:pdf
大小:2MB
软件:PDF阅读器
页数:0
分类:
上传时间:2012-08-07
浏览量:20