1
|
ngsJulia: population genetic analysis of next-generation DNA sequencing data with Julia language. F1000Res 2023; 11:126. [PMID: 37745626 PMCID: PMC10514575 DOI: 10.12688/f1000research.104368.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/29/2023] [Indexed: 09/26/2023] Open
Abstract
A sound analysis of DNA sequencing data is important to extract meaningful information and infer quantities of interest. Sequencing and mapping errors coupled with low and variable coverage hamper the identification of genotypes and variants and the estimation of population genetic parameters. Methods and implementations to estimate population genetic parameters from sequencing data available nowadays either are suitable for the analysis of genomes from model organisms only, require moderate sequencing coverage, or are not easily adaptable to specific applications. To address these issues, we introduce ngsJulia, a collection of templates and functions in Julia language to process short-read sequencing data for population genetic analysis. We further describe two implementations, ngsPool and ngsPloidy, for the analysis of pooled sequencing data and polyploid genomes, respectively. Through simulations, we illustrate the performance of estimating various population genetic parameters using these implementations, using both established and novel statistical methods. These results inform on optimal experimental design and demonstrate the applicability of methods in ngsJulia to estimate parameters of interest even from low coverage sequencing data. ngsJulia provide users with a flexible and efficient framework for ad hoc analysis of sequencing data.ngsJulia is available from: https://github.com/mfumagalli/ngsJulia.
Collapse
|
2
|
Genomic signatures of local adaptation in recent invasive Aedes aegypti populations in California. BMC Genomics 2023; 24:311. [PMID: 37301847 DOI: 10.1186/s12864-023-09402-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 05/23/2023] [Indexed: 06/12/2023] Open
Abstract
BACKGROUND Rapid adaptation to new environments can facilitate species invasions and range expansions. Understanding the mechanisms of adaptation used by invasive disease vectors in new regions has key implications for mitigating the prevalence and spread of vector-borne disease, although they remain relatively unexplored. RESULTS Here, we integrate whole-genome sequencing data from 96 Aedes aegypti mosquitoes collected from various sites in southern and central California with 25 annual topo-climate variables to investigate genome-wide signals of local adaptation among populations. Patterns of population structure, as inferred using principal components and admixture analysis, were consistent with three genetic clusters. Using various landscape genomics approaches, which all remove the confounding effects of shared ancestry on correlations between genetic and environmental variation, we identified 112 genes showing strong signals of local environmental adaptation associated with one or more topo-climate factors. Some of them have known effects in climate adaptation, such as heat-shock proteins, which shows selective sweep and recent positive selection acting on these genomic regions. CONCLUSIONS Our results provide a genome wide perspective on the distribution of adaptive loci and lay the foundation for future work to understand how environmental adaptation in Ae. aegypti impacts the arboviral disease landscape and how such adaptation could help or hinder efforts at population control.
Collapse
|
3
|
Contrasting Phylogeographic Patterns of Mitochondrial and Genome-Wide Variation in the Groundwater Amphipod Crangonyx islandicus That Survived the Ice Age in Iceland. DIVERSITY 2023. [DOI: 10.3390/d15010088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
The analysis of phylogeographic patterns has often been based on mitochondrial DNA variation, but recent analyses dealing with nuclear DNA have in some instances revealed mito-nuclear discordances and complex evolutionary histories. These enigmatic scenarios, which may involve stochastic lineage sorting, ancestral hybridization, past dispersal and secondary contacts, are increasingly scrutinized with a new generation of genomic tools such as RADseq, which also poses additional analytical challenges. Here, we revisited the previously inconclusive phylogeographic history, showing the mito-nuclear discordance of an endemic groundwater amphipod from Iceland, Crangonyx islandicus, which is the only metazoan known to have survived the Pleistocene beneath the glaciers. Previous studies based on three DNA markers documented a mitochondrial scenario with the main divergence occurring between populations in northern Iceland and an ITS scenario with the main divergence between the south and north. We used double digest restriction-site-associated DNA sequencing (ddRADseq) to clarify this mito-nuclear discordance by applying several statistical methods while estimating the sensitivity to different analytical approaches (data-type, differentiation indices and base call uncertainty). A majority of nuclear markers and methods support the ITS divergence. Nevertheless, a more complex scenario emerges, possibly involving introgression led by male-biased dispersal among northern locations or mitochondrial capture, which may have been further strengthened by natural selection.
Collapse
|
4
|
The impact of sequencing depth and relatedness of the reference genome in population genomic studies: A case study with two caddisfly species (Trichoptera, Rhyacophilidae, Himalopsyche). Ecol Evol 2022; 12:e9583. [PMID: 36523526 PMCID: PMC9745013 DOI: 10.1002/ece3.9583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 11/10/2022] [Accepted: 11/16/2022] [Indexed: 12/15/2022] Open
Abstract
Whole genome sequencing for generating SNP data is increasingly used in population genetic studies. However, obtaining genomes for massive numbers of samples is still not within the budgets of many researchers. It is thus imperative to select an appropriate reference genome and sequencing depth to ensure the accuracy of the results for a specific research question, while balancing cost and feasibility. To evaluate the effect of the choice of the reference genome and sequencing depth on downstream analyses, we used five confamilial reference genomes of variable relatedness and three levels of sequencing depth (3.5×, 7.5× and 12×) in a population genomic study on two caddisfly species: Himalopsyche digitata and H. tibetana. Using these 30 datasets (five reference genomes × three depths × two target species), we estimated population genetic indices (inbreeding coefficient, nucleotide diversity, pairwise F ST, and genome-wide distribution of F ST) based on variants and population structure (PCA and admixture) based on genotype likelihood estimates. The results showed that both distantly related reference genomes and lower sequencing depth lead to degradation of resolution. In addition, choosing a more closely related reference genome may significantly remedy the defects caused by low depth. Therefore, we conclude that population genetic studies would benefit from closely related reference genomes, especially as the costs of obtaining a high-quality reference genome continue to decrease. However, to determine a cost-efficient strategy for a specific population genomic study, a trade-off between reference genome relatedness and sequencing depth can be considered.
Collapse
|
5
|
Evidence of hard‐selective sweeps suggests independent adaptation to insecticides in Colorado potato beetle (Coleoptera: Chrysomelidae) populations. Evol Appl 2022; 15:1691-1705. [DOI: 10.1111/eva.13498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 10/07/2022] [Accepted: 10/11/2022] [Indexed: 12/01/2022] Open
|
6
|
Demographic history shapes genomic variation in an intracellular parasite with a wide geographic distribution. Mol Ecol 2022; 31:2528-2544. [DOI: 10.1111/mec.16419] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 02/14/2022] [Accepted: 02/28/2022] [Indexed: 11/27/2022]
|
7
|
A holobiont view of island biogeography: Unravelling patterns driving the nascent diversification of a Hawaiian spider and its microbial associates. Mol Ecol 2021; 31:1299-1316. [PMID: 34861071 DOI: 10.1111/mec.16301] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Revised: 11/16/2021] [Accepted: 11/18/2021] [Indexed: 12/24/2022]
Abstract
The diversification of a host lineage can be influenced by both the external environment and its assemblage of microbes. Here, we use a young lineage of spiders, distributed along a chronologically arranged series of volcanic mountains, to investigate how their associated microbial communities have changed as the spiders colonized new locations. Using the stick spider Ariamnes waikula (Araneae, Theridiidae) on the island of Hawai'i, and outgroup taxa on older islands, we tested whether each component of the "holobiont" (spider hosts, intracellular endosymbionts and gut microbial communities) showed correlated signatures of diversity due to sequential colonization from older to younger volcanoes. To investigate this, we generated ddRAD data for the host spiders and 16S rRNA gene amplicon data from their microbiota. We expected sequential colonizations to result in a (phylo)genetic structuring of the host spiders and in a diversity gradient in microbial communities. The results showed that the host A. waikula is indeed structured by geographical isolation, suggesting sequential colonization from older to younger volcanoes. Similarly, the endosymbiont communities were markedly different between Ariamnes species on different islands, but more homogeneous among A. waikula populations on the island of Hawai'i. Conversely, the gut microbiota, which we suspect is generally environmentally derived, was largely conserved across all populations and species. Our results show that different components of the holobiont respond in distinct ways to the dynamic environment of the volcanic archipelago. This highlights the necessity of understanding the interplay between different components of the holobiont, to properly characterize its evolution.
Collapse
|
8
|
Effective double-digest RAD sequencing and genotyping despite large genome size. Mol Ecol Resour 2021; 21:1037-1055. [PMID: 33351289 DOI: 10.1111/1755-0998.13314] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 12/03/2020] [Accepted: 12/14/2020] [Indexed: 11/28/2022]
Abstract
Obtaining informative data is the ambition of any genomic project, but in nonmodel species with very large genomes, pursuing such a goal requires surmounting a series of analytical challenges. Double-digest RAD sequencing is routinely used in nonmodel organisms and offers some control over the volume of data obtained. However, the volume of data recovered is not always an indication of the reliability of data sets, and quality checks are necessary to ensure that true and artefactual information is set apart. In the present study, we aim to fill the gap existing between the known applicability of RAD sequencing methods in plants with large genomes and the use of the retrieved loci for population genetic inference. By analysing two populations of Cypripedium calceolus, a nonmodel orchid species with a large genome size (1C ~ 31.6 Gbp), we provide a complete workflow from library preparation to bioinformatic filtering and inference of genetic diversity and differentiation. We show how filtering strategies to dismiss potentially misleading data need to be explored and adapted to data set-specific features. Moreover, we suggest that the occurrence of organellar sequences in libraries should not be neglected when planning the experiment and analysing the results. Finally, we explain how, in the absence of prior information about the genome of the species, seeking high standards of quality during library preparation and sequencing can provide an insurance against unpredicted technical or biological constraints.
Collapse
|
9
|
The pitfalls and virtues of population genetic summary statistics: Detecting selective sweeps in recent divergences. J Evol Biol 2020; 34:893-909. [DOI: 10.1111/jeb.13738] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 10/22/2020] [Accepted: 10/24/2020] [Indexed: 12/12/2022]
|
10
|
Development of diagnostic SNP markers for quality assurance and control in sweetpotato [Ipomoea batatas (L.) Lam.] breeding programs. PLoS One 2020; 15:e0232173. [PMID: 32330201 PMCID: PMC7182229 DOI: 10.1371/journal.pone.0232173] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 04/08/2020] [Indexed: 11/19/2022] Open
Abstract
Quality assurance and control (QA/QC) is an essential element of a breeding program's optimization efforts towards increased genetic gains. Due to auto-hexaploid genome complexity, a low-cost marker platform for routine QA/QC in sweetpotato breeding programs is still unavailable. We used 662 parents of the International Potato Center (CIP)'s global breeding program spanning Peru, Uganda, Mozambique and Ghana, to develop a low-density highly informative single nucleotide polymorphism (SNP) marker set to be deployed for routine QA/QC. Segregation of the selected 30 SNPs (two SNPs per base chromosome) in a recombined breeding population was evaluated using 282 progeny from some of the parents above. The progeny were replicated from in-vitro, screenhouse and field, and the selected SNP-set was confirmed to identify relatively similar mislabeling error rates as a high density SNP-set of 10,159 markers. Six additional trait-specific markers were added to the selected SNP set from previous quantitative trait loci mapping studies. The 36-SNP set will be deployed for QA/QC in breeding pipelines and in fingerprinting of advanced clones or released varieties to monitor genetic gains in famers' fields. The study also enabled evaluation of CIP's global breeding population structure and the effect of some of the most devastating stresses like sweetpotato virus disease on genetic variation management. These results will inform future deployment of genomic selection in sweetpotato.
Collapse
|
11
|
How "simple" methodological decisions affect interpretation of population structure based on reduced representation library DNA sequencing: A case study using the lake whitefish. PLoS One 2020; 15:e0226608. [PMID: 31978053 PMCID: PMC6980518 DOI: 10.1371/journal.pone.0226608] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Accepted: 12/01/2019] [Indexed: 12/30/2022] Open
Abstract
Reduced representation (RRL) sequencing approaches (e.g., RADSeq, genotyping by sequencing) require decisions about how much to invest in genome coverage and sequencing depth, as well as choices of values for adjustable bioinformatics parameters. To empirically explore the importance of these “simple” methodological decisions, we generated two independent sequencing libraries for the same 142 individual lake whitefish (Coregonus clupeaformis) using a nextRAD RRL approach: (1) a larger number of loci at low sequencing depth based on a 9mer (library A); and (2) fewer loci at higher sequencing depth based on a 10mer (library B). The fish were selected from populations with different levels of expected genetic subdivision. Each library was analyzed using the STACKS pipeline followed by three types of population structure assessment (FST, DAPC and ADMIXTURE) with iterative increases in the stringency of sequencing depth and missing data requirements, as well as more specific a priori population maps. Library B was always able to resolve strong population differentiation in all three types of assessment regardless of the selected parameters, largely due to retention of more loci in analyses. In contrast, library A produced more variable results; increasing the minimum sequencing depth threshold (-m) resulted in a reduced number of retained loci, and therefore lost resolution at high -m values for FST and ADMIXTURE, but not DAPC. When detecting fine population differentiation, the population map influenced the number of loci and missing data, which generated artefacts in all downstream analyses tested. Similarly, when examining fine scale population subdivision, library B was robust to changing parameters but library A lost resolution depending on the parameter set. We used library B to examine actual subdivision in our study populations. All three types of analysis found complete subdivision among populations in Lake Huron, ON and Dore Lake, SK, Canada using 10,640 SNP loci. Weak population subdivision was detected in Lake Huron with fish from sites in the north-west, Search Bay, North Point and Hammond Bay, showing slight differentiation. Overall, we show that apparently simple decisions about library construction and bioinformatics parameters can have important impacts on the interpretation of population subdivision. Although potentially more costly on a per-locus basis, early investment in striking a balance between the number of loci and sequencing effort is well worth the reduced genomic coverage for population genetics studies. More conservative stringency settings on STACKS parameters lead to a final dataset that was more consistent and robust when examining both weak and strong population differentiation. Overall, we recommend that researchers approach “simple” methodological decisions with caution, especially when working on non-model species for the first time.
Collapse
|
12
|
Empirical design of a variant quality control pipeline for whole genome sequencing data using replicate discordance. Sci Rep 2019; 9:16156. [PMID: 31695094 PMCID: PMC6834861 DOI: 10.1038/s41598-019-52614-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Accepted: 10/18/2019] [Indexed: 12/29/2022] Open
Abstract
The success of next-generation sequencing depends on the accuracy of variant calls. Few objective protocols exist for QC following variant calling from whole genome sequencing (WGS) data. After applying QC filtering based on Genome Analysis Tool Kit (GATK) best practices, we used genotype discordance of eight samples that were sequenced twice each to evaluate the proportion of potentially inaccurate variant calls. We designed a QC pipeline involving hard filters to improve replicate genotype concordance, which indicates improved accuracy of genotype calls. Our pipeline analyzes the efficacy of each filtering step. We initially applied this strategy to well-characterized variants from the ClinVar database, and subsequently to the full WGS dataset. The genome-wide biallelic pipeline removed 82.11% of discordant and 14.89% of concordant genotypes, and improved the concordance rate from 98.53% to 99.69%. The variant-level read depth filter most improved the genome-wide biallelic concordance rate. We also adapted this pipeline for triallelic sites, given the increasing proportion of multiallelic sites as sample sizes increase. For triallelic sites containing only SNVs, the concordance rate improved from 97.68% to 99.80%. Our QC pipeline removes many potentially false positive calls that pass in GATK, and may inform future WGS studies prior to variant effect analysis.
Collapse
|
13
|
Host and geography together drive early adaptive radiation of Hawaiian planthoppers. Mol Ecol 2019; 28:4513-4528. [PMID: 31484218 DOI: 10.1111/mec.15231] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2018] [Revised: 08/19/2019] [Accepted: 08/27/2019] [Indexed: 11/30/2022]
Abstract
The interactions between insects and their plant host have been implicated in driving diversification of both players. Early arguments highlighted the role of ecological opportunity, with the idea that insects "escape and radiate" on new hosts, with subsequent hypotheses focusing on the interplay between host shifting and host tracking, coupled with isolation and fusion, in generating diversity. Because it is rarely possible to capture the initial stages of diversification, it is particularly difficult to ascertain the relative roles of geographic isolation versus host shifts in initiating the process. The current study examines genetic diversity between populations and hosts within a single species of endemic Hawaiian planthopper, Nesosydne umbratica (Hemiptera, Delphacidae). Given that the species was known as a host generalist occupying unrelated hosts, Clermontia (Campanulaceae) and Pipturus (Urticaceae), we set out to determine the relative importance of geography and host in structuring populations in the early stages of differentiation on the youngest islands of the Hawaiian chain. Results from extensive exon capture data showed that N. umbratica is highly structured, both by geography, with discrete populations on each volcano, and by host plant, with parallel radiations on Clermontia and Pipturus leading to extensive co-occurrence. The marked genetic structure suggests that populations can readily become established on novel hosts provided opportunity; subsequent adaptation allows monopolization of the new host. The results support the role of geographic isolation in structuring populations and with host shifts occurring as discrete events that facilitate subsequent parallel geographic range expansion.
Collapse
|
14
|
The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet 2019; 15:e1008302. [PMID: 31348818 PMCID: PMC6685638 DOI: 10.1371/journal.pgen.1008302] [Citation(s) in RCA: 96] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Revised: 08/07/2019] [Accepted: 07/10/2019] [Indexed: 11/18/2022] Open
Abstract
Haploid high quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map successfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele. In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp-reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudo-haploid data, i.e. they randomly sample only one sequencing read per site. We show that reference bias is pervasive in published ancient DNA sequence data of prehistoric humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Most genomic regions we investigated show little to no mapping bias but even a small proportion of sites with bias can impact analyses of those particular loci or slightly skew genome-wide estimates. Therefore, reference bias has the potential to cause minor but significant differences in the results of downstream analyses such as population allele sharing, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.
Collapse
|
15
|
Genome-wide divergence among invasive populations of Aedes aegypti in California. BMC Genomics 2019; 20:204. [PMID: 30866822 PMCID: PMC6417271 DOI: 10.1186/s12864-019-5586-4] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 03/05/2019] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND In the summer of 2013, Aedes aegypti Linnaeus was first detected in three cities in central California (Clovis, Madera and Menlo Park). It has now been detected in multiple locations in central and southern CA as far south as San Diego and Imperial Counties. A number of published reports suggest that CA populations have been established from multiple independent introductions. RESULTS Here we report the first population genomics analyses of Ae. aegypti based on individual, field collected whole genome sequences. We analyzed 46 Ae. aegypti genomes to establish genetic relationships among populations from sites in California, Florida and South Africa. Based on 4.65 million high quality biallelic SNPs, we identified 3 major genetic clusters within California; one that includes all sample sites in the southern part of the state (South of Tehachapi mountain range) plus the town of Exeter in central California and two additional clusters in central California. CONCLUSIONS A lack of concordance between mitochondrial and nuclear genealogies suggests that the three founding populations were polymorphic for two main mitochondrial haplotypes prior to being introduced to California. One of these has been lost in the Clovis populations, possibly by a founder effect. Genome-wide comparisons indicate extensive differentiation between genetic clusters. Our observations support recent introductions of Ae. aegypti into California from multiple, genetically diverged source populations. Our data reveal signs of hybridization among diverged populations within CA. Genetic markers identified in this study will be of great value in pursuing classical population genetic studies which require larger sample sizes.
Collapse
|
16
|
The fate of genes that cross species boundaries after a major hybridization event in a natural mosquito population. Mol Ecol 2018; 27:4978-4990. [DOI: 10.1111/mec.14947] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 07/17/2018] [Accepted: 07/19/2018] [Indexed: 12/30/2022]
|
17
|
Abstract
Here we report the complete mitochondrial sequences of 70 individual field collected mosquito specimens from throughout Sub-Saharan Africa. We generated this dataset to identify species specific markers for the following Anopheles species and chromosomal forms: An. arabiensis, An. coluzzii (The Forest and Mopti chromosomal forms) and An. gambiae (The Bamako and Savannah chromosomal forms). The raw Illumina sequencing reads were mapped to the NC_002084 reference mitogenome sequence. A total of 783 single nucleotide polymorphisms (SNPs) were detected on the mitochondrial genome, of which 460 are singletons (58.7%). None of these SNPs are suitable as molecular markers to distinguish among An. arabiensis, An. coluzzii and An. gambiae or any of the chromosomal forms. The lack of species or chromosomal form specific markers is also reflected in the constructed phylogenetic tree, which shows no clear division among the operational taxonomic units considered here.
Collapse
|
18
|
Abstract
Here we report the complete mitochondrial sequences of 70 individual field collected mosquito specimens from throughout Sub-Saharan Africa. We generated this dataset to identify species specific markers for the following Anopheles species and chromosomal forms: An. arabiensis, An. coluzzii (The Forest and Mopti chromosomal forms) and An. gambiae (The Bamako and Savannah chromosomal forms). The raw Illumina sequencing reads were mapped to the NC_002084 reference mitogenome sequence. A total of 783 single nucleotide polymorphisms (SNPs) were detected on the mitochondrial genome, of which 460 are singletons (58.7%). None of these SNPs are suitable as molecular markers to distinguish among An. arabiensis, An. coluzzii and An. gambiae or any of the chromosomal forms. The lack of species or chromosomal form specific markers is also reflected in the constructed phylogenetic tree, which shows no clear division among the operational taxonomic units considered here.
Collapse
|
19
|
Fresh is best: Accurate SNP genotyping from koala scats. Ecol Evol 2018; 8:3139-3151. [PMID: 29607013 PMCID: PMC5869377 DOI: 10.1002/ece3.3765] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Revised: 08/29/2017] [Accepted: 12/07/2017] [Indexed: 12/25/2022] Open
Abstract
Maintaining genetic diversity is a crucial component in conserving threatened species. For the iconic Australian koala, there is little genetic information on wild populations that is not either skewed by biased sampling methods (e.g., sampling effort skewed toward urban areas) or of limited usefulness due to low numbers of microsatellites used. The ability to genotype DNA extracted from koala scats using next‐generation sequencing technology will not only help resolve location sample bias but also improve the accuracy and scope of genetic analyses (e.g., neutral vs. adaptive genetic diversity, inbreeding, and effective population size). Here, we present the successful SNP genotyping (1272 SNP loci) of koala DNA extracted from scat, using a proprietary DArTseq™ protocol. We compare genotype results from two‐day‐old scat DNA and 14‐day‐old scat DNA to a blood DNA template, to test accuracy of scat genotyping. We find that DNA from fresher scat results in fewer loci with missing information than DNA from older scat; however, 14‐day‐old scat can still provide useful genetic information, depending on the research question. We also find that a subset of 209 conserved loci can accurately identify individual koalas, even from older scat samples. In addition, we find that DNA sequences identified from scat samples through the DArTseq™ process can provide genetic identification of koala diet species, bacterial and viral pathogens, and parasitic organisms.
Collapse
|
20
|
Optimized Next-Generation Sequencing Genotype-Haplotype Calling for Genome Variability Analysis. Evol Bioinform Online 2017; 13:1176934317723884. [PMID: 28894353 PMCID: PMC5582667 DOI: 10.1177/1176934317723884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Accepted: 05/23/2017] [Indexed: 11/17/2022] Open
Abstract
The accurate estimation of nucleotide variability using next-generation sequencing data is challenged by the high number of sequencing errors produced by new sequencing technologies, especially for nonmodel species, where reference sequences may not be available and the read depth may be low due to limited budgets. The most popular single-nucleotide polymorphism (SNP) callers are designed to obtain a high SNP recovery and low false discovery rate but are not designed to account appropriately the frequency of the variants. Instead, algorithms designed to account for the frequency of SNPs give precise results for estimating the levels and the patterns of variability. These algorithms are focused on the unbiased estimation of the variability and not on the high recovery of SNPs. Here, we implemented a fast and optimized parallel algorithm that includes the method developed by Roesti et al and Lynch, which estimates the genotype of each individual at each site, considering the possibility to call both bases from the genotype, a single one or none. This algorithm does not consider the reference and therefore is independent of biases related to the reference nucleotide specified. The pipeline starts from a BAM file converted to pileup or mpileup format and the software outputs a FASTA file. The new program not only reduces the running times but also, given the improved use of resources, it allows its usage with smaller computers and large parallel computers, expanding its benefits to a wider range of researchers. The output file can be analyzed using software for population genetics analysis, such as the R library PopGenome, the software VariScan, and the program mstatspop for analysis considering positions with missing data.
Collapse
|
21
|
Efficiency of ddRAD target enriched sequencing across spiny rock lobster species (Palinuridae: Jasus). Sci Rep 2017; 7:6781. [PMID: 28754989 PMCID: PMC5533801 DOI: 10.1038/s41598-017-06582-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2017] [Accepted: 06/13/2017] [Indexed: 01/05/2023] Open
Abstract
Double digest restriction site-associated DNA sequencing (ddRADseq) and target capture sequencing methods are used to explore population and phylogenetic questions in non-model organisms. ddRADseq offers a simple and reliable protocol for population genomic studies, however it can result in a large amount of missing data due to allelic dropout. Target capture sequencing offers an opportunity to increase sequencing coverage with little missing data and consistent orthologous loci across samples, although this approach has generally been applied to conserved markers for deeper evolutionary questions. Here, we combine both methods to generate high quality sequencing data for population genomic studies of all marine lobster species from the genus Jasus. We designed probes based on ddRADseq libraries of two lobster species (Jasus edwardsii and Sagmariasus verreauxi) and evaluated the captured sequencing data in five other Jasus species. We validated 4,465 polymorphic loci amongst these species using a cost effective sequencing protocol, of which 1,730 were recovered from all species, and 4,026 were present in at least three species. The method was also successfully applied to DNA samples obtained from museum specimens. This data will be further used to assess spatial-temporal genetic variation in Jasus species found in the Southern Hemisphere.
Collapse
|
22
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
23
|
Increasing Genome Sampling and Improving SNP Genotyping for Genotyping-by-Sequencing with New Combinations of Restriction Enzymes. G3 (BETHESDA, MD.) 2016; 6:845-56. [PMID: 26818077 PMCID: PMC4825655 DOI: 10.1534/g3.115.025775] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Accepted: 01/22/2016] [Indexed: 12/15/2022]
Abstract
Genotyping-by-sequencing (GBS) has emerged as a useful genomic approach for exploring genome-wide genetic variation. However, GBS commonly samples a genome unevenly and can generate a substantial amount of missing data. These technical features would limit the power of various GBS-based genetic and genomic analyses. Here we present software called IgCoverage for in silico evaluation of genomic coverage through GBS with an individual or pair of restriction enzymes on one sequenced genome, and report a new set of 21 restriction enzyme combinations that can be applied to enhance GBS applications. These enzyme combinations were developed through an application of IgCoverage on 22 plant, animal, and fungus species with sequenced genomes, and some of them were empirically evaluated with different runs of Illumina MiSeq sequencing in 12 plant species. The in silico analysis of 22 organisms revealed up to eight times more genome coverage for the new combinations consisted of pairing four- or five-cutter restriction enzymes than the commonly used enzyme combination PstI + MspI. The empirical evaluation of the new enzyme combination (HinfI + HpyCH4IV) in 12 plant species showed 1.7-6 times more genome coverage than PstI + MspI, and 2.3 times more genome coverage in dicots than monocots. Also, the SNP genotyping in 12 Arabidopsis and 12 rice plants revealed that HinfI + HpyCH4IV generated 7 and 1.3 times more SNPs (with 0-16.7% missing observations) than PstI + MspI, respectively. These findings demonstrate that these novel enzyme combinations can be utilized to increase genome sampling and improve SNP genotyping in various GBS applications.
Collapse
|
24
|
PSMC analysis of effective population sizes in molecular ecology and its application to black-and-white Ficedula flycatchers. Mol Ecol 2016; 25:1058-72. [PMID: 26797914 PMCID: PMC4793928 DOI: 10.1111/mec.13540] [Citation(s) in RCA: 148] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Revised: 12/15/2015] [Accepted: 01/07/2016] [Indexed: 12/12/2022]
Abstract
Climatic fluctuations during the Quaternary period governed the demography of species and contributed to population differentiation and ultimately speciation. Studies of these past processes have previously been hindered by a lack of means and genetic data to model changes in effective population size (Ne ) through time. However, based on diploid genome sequences of high quality, the recently developed pairwise sequentially Markovian coalescent (PSMC) can estimate trajectories of changes in Ne over considerable time periods. We applied this approach to resequencing data from nearly 200 genomes of four species and several populations of the Ficedula species complex of black-and-white flycatchers. Ne curves of Atlas, collared, pied and semicollared flycatcher converged 1-2 million years ago (Ma) at an Ne of ≈ 200 000, likely reflecting the time when all four species last shared a common ancestor. Subsequent separate Ne trajectories are consistent with lineage splitting and speciation. All species showed evidence of population growth up until 100-200 thousand years ago (kya), followed by decline and then start of a new phase of population expansion. However, timing and amplitude of changes in Ne differed among species, and for pied flycatcher, the temporal dynamics of Ne differed between Spanish birds and central/northern European populations. This cautions against extrapolation of demographic inference between lineages and calls for adequate sampling to provide representative pictures of the coalescence process in different species or populations. We also empirically evaluate criteria for proper inference of demographic histories using PSMC and arrive at recommendations of using sequencing data with a mean genome coverage of ≥18X, a per-site filter of ≥10 reads and no more than 25% of missing data.
Collapse
|
25
|
Sampling strategies for frequency spectrum-based population genomic inference. BMC Evol Biol 2014; 14:254. [PMID: 25471595 PMCID: PMC4269862 DOI: 10.1186/s12862-014-0254-4] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 11/24/2014] [Indexed: 01/25/2023] Open
Abstract
Background The allele frequency spectrum (AFS) consists of counts of the number of single nucleotide polymorphism (SNP) loci with derived variants present at each given frequency in a sample. Multiple approaches have recently been developed for parameter estimation and calculation of model likelihoods based on the joint AFS from two or more populations. We conducted a simulation study of one of these approaches, implemented in the Python module δaδi, to compare parameter estimation and model selection accuracy given different sample sizes under one- and two-population models. Results Our simulations included a variety of demographic models and two parameterizations that differed in the timing of events (divergence or size change). Using a number of SNPs reasonably obtained through next-generation sequencing approaches (10,000 - 50,000), accurate parameter estimates and model selection were possible for models with more ancient demographic events, even given relatively small numbers of sampled individuals. However, for recent events, larger numbers of individuals were required to achieve accuracy and precision in parameter estimates similar to that seen for models with older divergence or population size changes. We quantify i) the uncertainty in model selection, using tools from information theory, and ii) the accuracy and precision of parameter estimates, using the root mean squared error, as a function of the timing of demographic events, sample sizes used in the analysis, and complexity of the simulated models. Conclusions Here, we illustrate the utility of the genome-wide AFS for estimating demographic history and provide recommendations to guide sampling in population genomics studies that seek to draw inference from the AFS. Our results indicate that larger samples of individuals (and thus larger AFS) provide greater power for model selection and parameter estimation for more recent demographic events. Electronic supplementary material The online version of this article (doi:10.1186/s12862-014-0254-4) contains supplementary material, which is available to authorized users.
Collapse
|
26
|
Abstract
The recent advances in sequencing throughput and genome assembly algorithms have established whole-genome shotgun (WGS) assemblies as the cornerstone of the genomic infrastructure for many species. WGS assemblies can be constructed with comparative ease and give a comprehensive representation of the gene space even of large and complex genomes. One major obstacle in utilizing WGS assemblies for important research applications such as gene isolation or comparative genomics has been the lack of chromosomal positioning and contextualization of short sequence contigs. Assigning chromosomal locations to sequence contigs required the construction and integration of genome-wide physical maps and dense genetic linkage maps as well as synteny to model species. Recently, methods to rapidly construct ultra-dense linkage maps encompassing millions of genetic markers from WGS sequencing data of segregating populations have made possible the direct assignment of genetic positions to short sequence contigs. Here, we review recent developments in the integration of WGS assemblies and sequence-based linkage maps, discuss challenges for further improvement of the methodology and outline possible applications building on genetically anchored WGS assemblies.
Collapse
|
27
|
Pipeliner: software to evaluate the performance of bioinformatics pipelines for next-generation resequencing. Mol Ecol Resour 2014; 15:99-106. [PMID: 24890372 DOI: 10.1111/1755-0998.12286] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2014] [Revised: 05/19/2014] [Accepted: 05/23/2014] [Indexed: 12/30/2022]
Abstract
The choice of technology and bioinformatics approach is critical in obtaining accurate and reliable information from next-generation sequencing (NGS) experiments. An increasing number of software and methodological guidelines are being published, but deciding upon which approach and experimental design to use can depend on the particularities of the species and on the aims of the study. This leaves researchers unable to produce informed decisions on these central questions. To address these issues, we developed pipeliner - a tool to evaluate, by simulation, the performance of NGS pipelines in resequencing studies. Pipeliner provides a graphical interface allowing the users to write and test their own bioinformatics pipelines with publicly available or custom software. It computes a number of statistics summarizing the performance in SNP calling, including the recovery, sensitivity and false discovery rate for heterozygous and homozygous SNP genotypes. Pipeliner can be used to answer many practical questions, for example, for a limited amount of NGS effort, how many more reliable SNPs can be detected by doubling coverage and halving sample size or what is the false discovery rate provided by different SNP calling algorithms and options. Pipeliner thus allows researchers to carefully plan their study's sampling design and compare the suitability of alternative bioinformatics approaches for their specific study systems. Pipeliner is written in C++ and is freely available from http://github.com/brunonevado/Pipeliner.
Collapse
|
28
|
Contrasting X-linked and autosomal diversity across 14 human populations. Am J Hum Genet 2014; 94:827-44. [PMID: 24836452 DOI: 10.1016/j.ajhg.2014.04.011] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2014] [Accepted: 04/15/2014] [Indexed: 12/29/2022] Open
Abstract
Contrasting the genetic diversity of the human X chromosome (X) and autosomes has facilitated understanding historical differences between males and females and the influence of natural selection. Previous studies based on smaller data sets have left questions regarding how empirical patterns extend to additional populations and which forces can explain them. Here, we address these questions by analyzing the ratio of X-to-autosomal (X/A) nucleotide diversity with the complete genomes of 569 females from 14 populations. Results show that X/A diversity is similar within each continental group but notably lower in European (EUR) and East Asian (ASN) populations than in African (AFR) populations. X/A diversity increases in all populations with increasing distance from genes, highlighting the stronger impact of diversity-reducing selection on X than on the autosomes. However, relative X/A diversity (between two populations) is invariant with distance from genes, suggesting that selection does not drive the relative reduction in X/A diversity in non-Africans (0.842 ± 0.012 for EUR-to-AFR and 0.820 ± 0.032 for ASN-to-AFR comparisons). Finally, an array of models with varying population bottlenecks, expansions, and migration from the latest studies of human demographic history account for about half of the observed reduction in relative X/A diversity from the expected value of 1. They predict values between 0.91 and 0.94 for EUR-to-AFR comparisons and between 0.91 and 0.92 for ASN-to-AFR comparisons. Further reductions can be predicted by more extreme demographic events in excess of those captured by the latest studies but, in the absence of these, also by historical sex-biased demographic events or other processes.
Collapse
|
29
|
Exploring the occurrence of classic selective sweeps in humans using whole-genome sequencing data sets. Mol Biol Evol 2014; 31:1850-68. [PMID: 24694833 DOI: 10.1093/molbev/msu118] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Genome-wide scans for selection have identified multiple regions of the human genome as being targeted by positive selection. However, only a small proportion has been replicated across studies, and the prevalence of positive selection as a mechanism of adaptive change in humans remains controversial. Here we explore the power of two haplotype-based statistics--the integrated haplotype score (iHS) and the Derived Intraallelic Nucleotide Diversity (DIND) test--in the context of next-generation sequencing data, and evaluate their robustness to demography and other selection modes. We show that these statistics are both powerful for the detection of recent positive selection, regardless of population history, and robust to variation in coverage, with DIND being insensitive to very low coverage. We apply these statistics to whole-genome sequence data sets from the 1000 Genomes Project and Complete Genomics. We found that putative targets of selection were highly significantly enriched in genic and nonsynonymous single nucleotide polymorphisms, and that DIND was more powerful than iHS in the context of small sample sizes, low-quality genotype calling, or poor coverage. As we excluded genomic confounders and alternative selection models, such as background selection, the observed enrichment attests to the action of recent, strong positive selection. Further support to the adaptive significance of these genomic regions came from their enrichment in functional variants detected by genome-wide association studies, informing the relationship between past selection and current benign and disease-related phenotypic variation. Our results indicate that hard sweeps targeting low-frequency standing variation have played a moderate, albeit significant, role in recent human evolution.
Collapse
|
30
|
Resequencing studies of nonmodel organisms using closely related reference genomes: optimal experimental designs and bioinformatics approaches for population genomics. Mol Ecol 2014; 23:1764-79. [DOI: 10.1111/mec.12693] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
31
|
Genetic diversity analysis of highly incomplete SNP genotype data with imputations: an empirical assessment. G3-GENES GENOMES GENETICS 2014; 4:891-900. [PMID: 24626289 PMCID: PMC4025488 DOI: 10.1534/g3.114.010942] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Genotyping by sequencing (GBS) recently has emerged as a promising genomic approach for assessing genetic diversity on a genome-wide scale. However, concerns are not lacking about the uniquely large unbalance in GBS genotype data. Although some genotype imputation has been proposed to infer missing observations, little is known about the reliability of a genetic diversity analysis of GBS data, with up to 90% of observations missing. Here we performed an empirical assessment of accuracy in genetic diversity analysis of highly incomplete single nucleotide polymorphism genotypes with imputations. Three large single-nucleotide polymorphism genotype data sets for corn, wheat, and rice were acquired, and missing data with up to 90% of missing observations were randomly generated and then imputed for missing genotypes with three map-independent imputation methods. Estimating heterozygosity and inbreeding coefficient from original, missing, and imputed data revealed variable patterns of bias from assessed levels of missingness and genotype imputation, but the estimation biases were smaller for missing data without genotype imputation. The estimates of genetic differentiation were rather robust up to 90% of missing observations but became substantially biased when missing genotypes were imputed. The estimates of topology accuracy for four representative samples of interested groups generally were reduced with increased levels of missing genotypes. Probabilistic principal component analysis based imputation performed better in terms of topology accuracy than those analyses of missing data without genotype imputation. These findings are not only significant for understanding the reliability of the genetic diversity analysis with respect to large missing data and genotype imputation but also are instructive for performing a proper genetic diversity analysis of highly incomplete GBS or other genotype data.
Collapse
|
32
|
Assessing the effect of sequencing depth and sample size in population genetics inferences. PLoS One 2013; 8:e79667. [PMID: 24260275 PMCID: PMC3832539 DOI: 10.1371/journal.pone.0079667] [Citation(s) in RCA: 90] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2013] [Accepted: 09/23/2013] [Indexed: 12/31/2022] Open
Abstract
Next-Generation Sequencing (NGS) technologies have dramatically revolutionised research in many fields of genetics. The ability to sequence many individuals from one or multiple populations at a genomic scale has greatly enhanced population genetics studies and made it a data-driven discipline. Recently, researchers have proposed statistical modelling to address genotyping uncertainty associated with NGS data. However, an ongoing debate is whether it is more beneficial to increase the number of sequenced individuals or the per-sample sequencing depth for estimating genetic variation. Through extensive simulations, I assessed the accuracy of estimating nucleotide diversity, detecting polymorphic sites, and predicting population structure under different experimental scenarios. Results show that the greatest accuracy for estimating population genetics parameters is achieved by employing a large sample size, despite single individuals being sequenced at low depth. Under some circumstances, the minimum sequencing depth for obtaining accurate estimates of allele frequencies and to identify polymorphic sites is [Formula: see text], where both alleles are more likely to have been sequenced. On the other hand, inferences of population structure are more accurate at very large sample sizes, even with extremely low sequencing depth. This all points to the conclusion that under various experimental scenarios, in cost-limited population genetics studies, large sample sizes at low sequencing depth are desirable to achieve high accuracy. These findings will help researchers design their experimental set-ups and guide further investigation on the effect of protocol design for genetic research.
Collapse
|
33
|
Abstract
We introduce a flexible and robust simulation-based framework to infer demographic parameters from the site frequency spectrum (SFS) computed on large genomic datasets. We show that our composite-likelihood approach allows one to study evolutionary models of arbitrary complexity, which cannot be tackled by other current likelihood-based methods. For simple scenarios, our approach compares favorably in terms of accuracy and speed with ∂a∂i, the current reference in the field, while showing better convergence properties for complex models. We first apply our methodology to non-coding genomic SNP data from four human populations. To infer their demographic history, we compare neutral evolutionary models of increasing complexity, including unsampled populations. We further show the versatility of our framework by extending it to the inference of demographic parameters from SNP chips with known ascertainment, such as that recently released by Affymetrix to study human origins. Whereas previous ways of handling ascertained SNPs were either restricted to a single population or only allowed the inference of divergence time between a pair of populations, our framework can correctly infer parameters of more complex models including the divergence of several populations, bottlenecks and migration. We apply this approach to the reconstruction of African demography using two distinct ascertained human SNP panels studied under two evolutionary models. The two SNP panels lead to globally very similar estimates and confidence intervals, and suggest an ancient divergence (>110 Ky) between Yoruba and San populations. Our methodology appears well suited to the study of complex scenarios from large genomic data sets.
Collapse
|
34
|
Calculation of Tajima's D and other neutrality test statistics from low depth next-generation sequencing data. BMC Bioinformatics 2013; 14:289. [PMID: 24088262 PMCID: PMC4015034 DOI: 10.1186/1471-2105-14-289] [Citation(s) in RCA: 152] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2013] [Accepted: 09/25/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima's D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. However, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions. RESULTS We have developed an approach that accommodates the uncertainty of the data when calculating site frequency based neutrality test statistics. A salient feature of this approach is that it implicitly solves the problems of varying sequencing depth, missing data and avoids the need to infer variable sites for the analysis and thereby avoids ascertainment problems introduced by a SNP discovery process. CONCLUSION Using an empirical Bayes approach for fast computations, we show that this method produces results for low-coverage NGS data comparable to those achieved when the genotypes are known without uncertainty. We also validate the method in an analysis of data from the 1000 genomes project. The method is implemented in a fast framework which enables researchers to perform these neutrality tests on a genome-wide scale.
Collapse
|
35
|
Archived neonatal dried blood spot samples can be used for accurate whole genome and exome-targeted next-generation sequencing. Mol Genet Metab 2013; 110:65-72. [PMID: 23830478 DOI: 10.1016/j.ymgme.2013.06.004] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/11/2013] [Revised: 06/03/2013] [Accepted: 06/04/2013] [Indexed: 11/23/2022]
Abstract
Dried blood spot samples (DBSS) have been collected and stored for decades as part of newborn screening programmes worldwide. Representing almost an entire population under a certain age and collected with virtually no bias, the Newborn Screening Biobanks are of immense value in medical studies, for example, to examine the genetics of various disorders. We have previously demonstrated that DNA extracted from a fraction (2×3.2mm discs) of an archived DBSS can be whole genome amplified (wgaDNA) and used for accurate array genotyping. However, until now, it has been uncertain whether wgaDNA from DBSS can be used for accurate whole genome sequencing (WGS) and exome sequencing (WES). This study examined two individuals represented by three different types of samples each: whole-blood (reference samples), 3-year-old DBSS spotted with reference material (refDBSS), and 27- to 29-year-old archived neonatal DBSS (neoDBSS) stored at -20°C in the Danish Newborn Screening Biobank. The reference samples were genotyped using an Illumina Omni2.5M array, and all samples were sequenced on a HighSeq2000 Paired-End flow cell. First, we compared the array single nucleotide polymorphism (SNP) genotype data to the single nucleotide variation (SNV) calls from the WGS and WES SNV calls. We also compared the WGS and WES reference sample SNV calls to the DBSS SNV calls. The overall performance of the archived DBSS was similar to the whole blood reference sample. Plotting the error rates relative to coverage revealed that the error rates of DBSS were similar to that of their reference samples. SNVs called with a coverage<×8 had error rates between 1.5 and 35%, whereas the error rates of SNVs called with a coverage≥8 were <1.5%. In conclusion, the wgaDNA amplified from both new and old neonatal DBSS perform as well as their whole-blood reference samples with regards to error rates, strongly indicating that neonatal DBSS collected shortly after birth and stored for decades comprise an excellent resource for NGS studies of disease.
Collapse
|
36
|
Abstract
Over the past few years, new high-throughput DNA sequencing technologies have dramatically increased speed and reduced sequencing costs. However, the use of these sequencing technologies is often challenged by errors and biases associated with the bioinformatical methods used for analyzing the data. In particular, the use of naïve methods to identify polymorphic sites and infer genotypes can inflate downstream analyses. Recently, explicit modeling of genotype probability distributions has been proposed as a method for taking genotype call uncertainty into account. Based on this idea, we propose a novel method for quantifying population genetic differentiation from next-generation sequencing data. In addition, we present a strategy for investigating population structure via principal components analysis. Through extensive simulations, we compare the new method herein proposed to approaches based on genotype calling and demonstrate a marked improvement in estimation accuracy for a wide range of conditions. We apply the method to a large-scale genomic data set of domesticated and wild silkworms sequenced at low coverage. We find that we can infer the fine-scale genetic structure of the sampled individuals, suggesting that employing this new method is useful for investigating the genetic relationships of populations sampled at low coverage.
Collapse
|