1
|
Avalos-Pacheco A, Cronjäger MC, Jenkins PA, Hein J. An almost infinite sites model. Theor Popul Biol 2024; 160:49-61. [PMID: 39454763 DOI: 10.1016/j.tpb.2024.10.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 09/11/2024] [Accepted: 10/11/2024] [Indexed: 10/28/2024]
Abstract
MOTIVATION A main challenge in molecular evolution is to find computationally efficient mutation models with flexible assumptions that properly reflect genetic variation. The infinite sites model assumes that each mutation event occurs at a site never previously mutant, i.e. it does not allow recurrent mutations. This is reasonable for low mutation rates and makes statistical inference much more tractable. However, recurrent mutations are common enough to be observable from genetic variation data, even in species with low per-site mutation rates such as humans. The finite sites model on the other hand allows for recurrent mutations but is computationally unfeasible to work with in most cases. In this work, we bridge these two approaches by developing a novel molecular evolution model, the almost infinite sites model, that both admits recurrent mutations and is tractable. We provide a recursive characterization of the likelihood of our proposed model under complete linkage and outline a parsimonious approximation scheme for computing it. RESULTS We show the usefulness of our model in simulated and human mitochondrial data. Our results show that the AISM, in combination with a constraint on the total number of mutation events, can recover accurate approximations to the maximum likelihood estimator of the mutation rate. AVAILABILITY AND IMPLEMENTATION An implementation of our model is freely available along with code for reproducing our computational experiments at https://github.com/Cronjaeger/almost-infinite-sites-recursions.
Collapse
Affiliation(s)
- Alejandra Avalos-Pacheco
- Institute of Applied Statistics, Johannes Kepler University Linz, 4040 Linz, Austria; Harvard-MIT Center for Regulatory Science, Harvard University, 210 Longwood Ave, Boston, MA 02155, United States of America
| | - Mathias C Cronjäger
- Department of Statistics, University of Oxford, 24-29 St Giles', Oxford OX1 3LB, United Kingdom; Novo Nordisk, 2880 Bagsværd, Denmark
| | - Paul A Jenkins
- Department of Statistics, University of Warwick, Coventry, CV4 7AL, United Kingdom; Department of Computer Science, University of Warwick, Coventry, CV4 7AL, United Kingdom; The Alan Turing Institute, British Library, 96 Euston Road, London NW1 2DB, United Kingdom
| | - Jotun Hein
- Department of Statistics, University of Oxford, 24-29 St Giles', Oxford OX1 3LB, United Kingdom.
| |
Collapse
|
2
|
Qu YN, Rao YZ, Qi YL, Li YX, Li A, Palmer M, Hedlund BP, Shu WS, Evans PN, Nie GX, Hua ZS, Li WJ. Panguiarchaeum symbiosum, a potential hyperthermophilic symbiont in the TACK superphylum. Cell Rep 2023; 42:112158. [PMID: 36827180 DOI: 10.1016/j.celrep.2023.112158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Revised: 12/27/2022] [Accepted: 02/09/2023] [Indexed: 02/24/2023] Open
Abstract
The biology of Korarchaeia remains elusive due to the lack of genome representatives. Here, we reconstruct 10 closely related metagenome-assembled genomes from hot spring habitats and place them into a single species, proposed herein as Panguiarchaeum symbiosum. Functional investigation suggests that Panguiarchaeum symbiosum is strictly anaerobic and grows exclusively in thermal habitats by fermenting peptides coupled with sulfide and hydrogen production to dispose of electrons. Due to its inability to biosynthesize archaeal membranes, amino acids, and purines, this species likely exists in a symbiotic lifestyle similar to DPANN archaea. Population metagenomics and metatranscriptomic analyses demonstrated that genes associated with amino acid/peptide uptake and cell attachment exhibited positive selection and were highly expressed, supporting the proposed proteolytic catabolism and symbiotic lifestyle. Our study sheds light on the metabolism, evolution, and potential symbiotic lifestyle of Panguiarchaeum symbiosum, which may be a unique host-dependent archaeon within the TACK superphylum.
Collapse
Affiliation(s)
- Yan-Ni Qu
- State Key Laboratory of Biocontrol, Guangdong Provincial Key Laboratory of Plant Resources and Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), School of Life Sciences, Sun Yat-Sen University, Guangzhou 510275, PR China
| | - Yang-Zhi Rao
- State Key Laboratory of Biocontrol, Guangdong Provincial Key Laboratory of Plant Resources and Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), School of Life Sciences, Sun Yat-Sen University, Guangzhou 510275, PR China
| | - Yan-Ling Qi
- Chinese Academy of Sciences Key Laboratory of Urban Pollutant Conversion, Department of Environmental Science and Engineering, University of Science and Technology of China, Hefei 230026, China
| | - Yu-Xian Li
- Chinese Academy of Sciences Key Laboratory of Urban Pollutant Conversion, Department of Environmental Science and Engineering, University of Science and Technology of China, Hefei 230026, China
| | - Andrew Li
- Chinese Academy of Sciences Key Laboratory of Urban Pollutant Conversion, Department of Environmental Science and Engineering, University of Science and Technology of China, Hefei 230026, China
| | - Marike Palmer
- School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV 89154, USA
| | - Brian P Hedlund
- School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV 89154, USA; Nevada Institute of Personalized Medicine, University of Nevada Las Vegas, Las Vegas, NV 89154, USA
| | - Wen-Sheng Shu
- School of Life Sciences, South China Normal University, Guangzhou 510631, PR China
| | - Paul N Evans
- The Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, St Lucia, QLD 4072, Australia
| | - Guo-Xing Nie
- College of Fisheries, Henan Normal University, Xinxiang, China
| | - Zheng-Shuang Hua
- Chinese Academy of Sciences Key Laboratory of Urban Pollutant Conversion, Department of Environmental Science and Engineering, University of Science and Technology of China, Hefei 230026, China.
| | - Wen-Jun Li
- State Key Laboratory of Biocontrol, Guangdong Provincial Key Laboratory of Plant Resources and Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), School of Life Sciences, Sun Yat-Sen University, Guangzhou 510275, PR China; State Key Laboratory of Desert and Oasis Ecology, Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences, Urumqi 830011, PR China.
| |
Collapse
|
3
|
Cockerill CA, Hasselgren M, Dussex N, Dalén L, von Seth J, Angerbjörn A, Wallén JF, Landa A, Eide NE, Flagstad Ø, Ehrich D, Sokolov A, Sokolova N, Norén K. Genomic Consequences of Fragmentation in the Endangered Fennoscandian Arctic Fox ( Vulpes lagopus). Genes (Basel) 2022; 13:2124. [PMID: 36421799 PMCID: PMC9690288 DOI: 10.3390/genes13112124] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Revised: 10/06/2022] [Accepted: 10/10/2022] [Indexed: 11/17/2022] Open
Abstract
Accelerating climate change is causing severe habitat fragmentation in the Arctic, threatening the persistence of many cold-adapted species. The Scandinavian arctic fox (Vulpes lagopus) is highly fragmented, with a once continuous, circumpolar distribution, it struggled to recover from a demographic bottleneck in the late 19th century. The future persistence of the entire Scandinavian population is highly dependent on the northernmost Fennoscandian subpopulations (Scandinavia and the Kola Peninsula), to provide a link to the viable Siberian population. By analyzing 43 arctic fox genomes, we quantified genomic variation and inbreeding in these populations. Signatures of genome erosion increased from Siberia to northern Sweden indicating a stepping-stone model of connectivity. In northern Fennoscandia, runs of homozygosity (ROH) were on average ~1.47-fold longer than ROH found in Siberia, stretching almost entire scaffolds. Moreover, consistent with recent inbreeding, northern Fennoscandia harbored more homozygous deleterious mutations, whereas Siberia had more in heterozygous state. This study underlines the value of documenting genome erosion following population fragmentation to identify areas requiring conservation priority. With the increasing fragmentation and isolation of Arctic habitats due to global warming, understanding the genomic and demographic consequences is vital for maintaining evolutionary potential and preventing local extinctions.
Collapse
Affiliation(s)
| | - Malin Hasselgren
- Department of Zoology, Stockholm University, 10691 Stockholm, Sweden
| | - Nicolas Dussex
- Department of Zoology, Stockholm University, 10691 Stockholm, Sweden
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, 11418 Stockholm, Sweden
| | - Love Dalén
- Department of Zoology, Stockholm University, 10691 Stockholm, Sweden
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, 11418 Stockholm, Sweden
| | - Johanna von Seth
- Department of Zoology, Stockholm University, 10691 Stockholm, Sweden
- Centre for Palaeogenetics, Svante Arrhenius väg 20C, 10691 Stockholm, Sweden
| | - Anders Angerbjörn
- Department of Zoology, Stockholm University, 10691 Stockholm, Sweden
| | - Johan F. Wallén
- Department of Zoology, Stockholm University, 10691 Stockholm, Sweden
| | - Arild Landa
- Norwegian Institute for Nature Research, 7485 Trondheim, Norway
| | - Nina E. Eide
- Norwegian Institute for Nature Research, 7485 Trondheim, Norway
| | | | - Dorothee Ehrich
- Department of Arctic and Marine Biology, UiT Arctic University of Tromsø, 9037 Tromsø, Norway
| | - Aleksandr Sokolov
- Arctic Research Station of Institute of Plant and Animal Ecology, Ural Branch, Russian Academy of Sciences, Zelenaya Gorka Str. 21, 629400 Labytnangi, Russia
| | - Natalya Sokolova
- Arctic Research Station of Institute of Plant and Animal Ecology, Ural Branch, Russian Academy of Sciences, Zelenaya Gorka Str. 21, 629400 Labytnangi, Russia
| | - Karin Norén
- Department of Zoology, Stockholm University, 10691 Stockholm, Sweden
| |
Collapse
|
4
|
Analysis of Microbial Community Dynamics during the Acclimatization Period of a Membrane Bioreactor Treating Table Olive Processing Wastewater. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9183647] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Biological treatment of table olive processing wastewater (TOPW) may be problematic due to its high organic and polyphenolic compound content. Biomass acclimatization is a necessary, yet sensitive, stage for efficient TOPW biological treatment. Next-generation sequencing technologies can provide valuable insights into this critical process step. An aerobic membrane bioreactor (MBR) system, initially inoculated with municipal activated sludge, was acclimatized to treat TOPW. Operational stability and bioremediation efficiency were monitored for approx. three months, whereas microbial community dynamics and metabolic adaptation were assessed through metagenomic and metatranscriptomic analysis. A swift change was identified in both the prokaryotic and eukaryotic bio-community after introduction of TOPW in the MBR, and a new diverse bio-community was established. Thauera and Paracoccus spp. are dominant contributors to the metabolic activity of the stable bio-community, which resulted in over 90% and 85% removal efficiency of total organic carbon and total polyphenols, respectively. This is the first study assessing the microbial community dynamics in a well-defined MBR process treating TOPW, offering guidance in the start-up of large-scale applications.
Collapse
|
5
|
Liao KH, Hon WK, Tang CY, Hsieh WP. MetaSMC: a coalescent-based shotgun sequence simulator for evolving microbial populations. Bioinformatics 2019; 35:1677-1685. [PMID: 30321266 DOI: 10.1093/bioinformatics/bty840] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Revised: 09/09/2018] [Accepted: 10/11/2018] [Indexed: 01/26/2023] Open
Abstract
MOTIVATION High-throughput sequencing technology has revolutionized the study of metagenomics and cancer evolution. In a relatively simple environment, a metagenomics sequencing data is dominated by a few species. By analyzing the alignment of reads from microbial species, single nucleotide polymorphisms can be discovered and the evolutionary history of the populations can be reconstructed. The ever-increasing read length will allow more detailed analysis about the evolutionary history of microbial or tumor cell population. A simulator of shotgun sequences from such populations will be helpful in the development or evaluation of analysis algorithms. RESULTS Here, we described an efficient algorithm, MetaSMC, which simulates reads from evolving microbial populations. Based on the coalescent theory, our simulator supports all evolutionary scenarios supported by other coalescent simulators. In addition, the simulator supports various substitution models, including Jukes-Cantor, HKY85 and generalized time-reversible models. The simulator also supports mutator phenotypes by allowing different mutation rates and substitution models in different subpopulations. Our algorithm ignores unnecessary chromosomal segments and thus is more efficient than standard coalescent when recombination is frequent. We showed that the process behind our algorithm is equivalent to Sequentially Markov Coalescent with an incomplete sample. The accuracy of our algorithm was evaluated by summary statistics and likelihood curves derived from Monte Carlo integration over large number of random genealogies. AVAILABILITY AND IMPLEMENTATION MetaSMC is written in C. The source code is available at https://github.com/tarjxvf/metasmc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ki-Hok Liao
- Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan
| | - Wing-Kai Hon
- Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan
| | - Chuan-Yi Tang
- Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan.,Department of Computer Science and Information Engineering, Providence University, Taichung, Taiwan
| | - Wen-Ping Hsieh
- Institute of Statistics, National Tsing-Hua University, Hsinchu, Taiwan
| |
Collapse
|
6
|
Peering into the Genetic Makeup of Natural Microbial Populations Using Metagenomics. POPULATION GENOMICS: MICROORGANISMS 2018. [DOI: 10.1007/13836_2018_14] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
7
|
Pedro PM, Piper R, Bazilli Neto P, Cullen L, Dropa M, Lorencao R, Matté MH, Rech TC, Rufato MO, Silva M, Turati DT. Metabarcoding Analyses Enable Differentiation of Both Interspecific Assemblages and Intraspecific Divergence in Habitats With Differing Management Practices. ENVIRONMENTAL ENTOMOLOGY 2017; 46:1381-1389. [PMID: 29069398 DOI: 10.1093/ee/nvx166] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Spatial and temporal collections provide important data on the distribution and dispersal of species. Regional-scale monitoring invariably involves hundreds of thousands of samples, the identification of which is costly in both time and money. In this respect, metabarcoding is increasingly seen as a viable alternative to traditional morphological identification, as it eliminates the taxonomic bottleneck previously impeding such work. Here, we assess whether terrestrial arthropods collected from 12 pitfall traps in two farms of a coffee (Coffea arabica L.) growing region of Sao Paulo State, Brazil could differentiate the two locations. We sequenced a portion of the cytochrome oxidase 1 region from minimally processed pools of samples and assessed inter- and intraspecific parameters across the two locations. Our sequencing was sufficient to circumscribe the overall diversity, which was characterized by few dominant taxa, principally small Coleoptera species and Collembola. Thirty-four operational taxonomic units were detected and of these, eight were present in significantly different quantities between the two farms. Analysis of community-wide Beta diversity grouped collections based on farm provenance. Moreover, haplotype-based analyses for a species of Xyleborus beetle showed that there is significant population genetic structuring between the two farms, suggesting limited dispersal. We conclude that metabarcoding can provide important management input and, considering the rapidly declining cost of sequencing, suggest that large-scale monitoring is now feasible and can identify both the taxa present as well as contribute information about genetic diversity of focal species.
Collapse
Affiliation(s)
| | - Ross Piper
- The Faculty of Biological Sciences, University of Leeds, United Kingdom
| | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Advanced Research and Data Methods in Women's Health: Big Data Analytics, Adaptive Studies, and the Road Ahead. Obstet Gynecol 2017; 129:249-264. [PMID: 28079771 DOI: 10.1097/aog.0000000000001865] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Technical advances in science have had broad implications in reproductive and women's health care. Recent innovations in population-level data collection and storage have made available an unprecedented amount of data for analysis while computational technology has evolved to permit processing of data previously thought too dense to study. "Big data" is a term used to describe data that are a combination of dramatically greater volume, complexity, and scale. The number of variables in typical big data research can readily be in the thousands, challenging the limits of traditional research methodologies. Regardless of what it is called, advanced data methods, predictive analytics, or big data, this unprecedented revolution in scientific exploration has the potential to dramatically assist research in obstetrics and gynecology broadly across subject matter. Before implementation of big data research methodologies, however, potential researchers and reviewers should be aware of strengths, strategies, study design methods, and potential pitfalls. Examination of big data research examples contained in this article provides insight into the potential and the limitations of this data science revolution and practical pathways for its useful implementation.
Collapse
|
9
|
Inferring Heterozygosity from Ancient and Low Coverage Genomes. Genetics 2016; 205:317-332. [PMID: 27821432 PMCID: PMC5223511 DOI: 10.1534/genetics.116.189985] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Accepted: 10/19/2016] [Indexed: 12/30/2022] Open
Abstract
While genetic diversity can be quantified accurately from high coverage sequencing data, it is often desirable to obtain such estimates from data with low coverage, either to save costs or because of low DNA quality, as is observed for ancient samples. Here, we introduce a method to accurately infer heterozygosity probabilistically from sequences with average coverage <1× of a single individual. The method relaxes the infinite sites assumption of previous methods, does not require a reference sequence, except for the initial alignment of the sequencing data, and takes into account both variable sequencing errors and potential postmortem damage. It is thus also applicable to nonmodel organisms and ancient genomes. Since error rates as reported by sequencing machines are generally distorted and require recalibration, we also introduce a method to accurately infer recalibration parameters in the presence of postmortem damage. This method does not require knowledge about the underlying genome sequence, but instead works with haploid data (e.g., from the X-chromosome from mammalian males) and integrates over the unknown genotypes. Using extensive simulations we show that a few megabasepairs of haploid data are sufficient for accurate recalibration, even at average coverages as low as 1×. At similar coverages, our method also produces very accurate estimates of heterozygosity down to 10−4 within windows of about 1 Mbp. We further illustrate the usefulness of our approach by inferring genome-wide patterns of diversity for several ancient human samples, and we found that 3000–5000-year-old samples showed diversity patterns comparable to those of modern humans. In contrast, two European hunter-gatherer samples exhibited not only considerably lower levels of diversity than modern samples, but also highly distinct distributions of diversity along their genomes. Interestingly, these distributions were also very different between the two samples, supporting earlier conclusions of a highly diverse and structured population in Europe prior to the arrival of farming.
Collapse
|
10
|
Wu SH, Rodrigo AG. Estimation of evolutionary parameters using short, random and partial sequences from mixed samples of anonymous individuals. BMC Bioinformatics 2015; 16:357. [PMID: 26536860 PMCID: PMC4634753 DOI: 10.1186/s12859-015-0810-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 10/30/2015] [Indexed: 11/17/2022] Open
Abstract
Background Over the last decade, next generation sequencing (NGS) has become widely available, and is now the sequencing technology of choice for most researchers. Nonetheless, NGS presents a challenge for the evolutionary biologists who wish to estimate evolutionary genetic parameters from a mixed sample of unlabelled or untagged individuals, especially when the reconstruction of full length haplotypes can be unreliable. We propose two novel approaches, least squares estimation (LS) and Approximate Bayesian Computation Markov chain Monte Carlo estimation (ABC-MCMC), to infer evolutionary genetic parameters from a collection of short-read sequences obtained from a mixed sample of anonymous DNA using the frequencies of nucleotides at each site only without reconstructing the full-length alignment nor the phylogeny. Results We used simulations to evaluate the performance of these algorithms, and our results demonstrate that LS performs poorly because bootstrap 95 % Confidence Intervals (CIs) tend to under- or over-estimate the true values of the parameters. In contrast, ABC-MCMC 95 % Highest Posterior Density (HPD) intervals recovered from ABC-MCMC enclosed the true parameter values with a rate approximately equivalent to that obtained using BEAST, a program that implements a Bayesian MCMC estimation of evolutionary parameters using full-length sequences. Because there is a loss of information with the use of sitewise nucleotide frequencies alone, the ABC-MCMC 95 % HPDs are larger than those obtained by BEAST. Conclusion We propose two novel algorithms to estimate evolutionary genetic parameters based on the proportion of each nucleotide. The LS method cannot be recommended as a standalone method for evolutionary parameter estimation. On the other hand, parameters recovered by ABC-MCMC are comparable to those obtained using BEAST, but with larger 95 % HPDs. One major advantage of ABC-MCMC is that computational time scales linearly with the number of short-read sequences, and is independent of the number of full-length sequences in the original data. This allows us to perform the analysis on NGS datasets with large numbers of short read fragments. The source code for ABC-MCMC is available at https://github.com/stevenhwu/SF-ABC.
Collapse
Affiliation(s)
- Steven H Wu
- Biodesign Institute, Arizona State University, Tempe, AZ, 85287, USA. .,Department of Biology, Duke University, Box 90338, Durham, NC, 27708, USA.
| | - Allen G Rodrigo
- Department of Biology, Duke University, Box 90338, Durham, NC, 27708, USA. .,The National Evolutionary Synthesis Center, Durham, NC, 27705, USA.
| |
Collapse
|
11
|
Chen H. Population genetic studies in the genomic sequencing era. DONG WU XUE YAN JIU = ZOOLOGICAL RESEARCH 2015; 36:223-32. [PMID: 26228473 DOI: 10.13918/j.issn.2095-8137.2015.4.223] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
Recent advances in high-throughput sequencing technologies have revolutionized the field of population genetics. Data now routinely contain genomic level polymorphism information, and the low cost of DNA sequencing enables researchers to investigate tens of thousands of subjects at a time. This provides an unprecedented opportunity to address fundamental evolutionary questions, while posing challenges on traditional population genetic theories and methods. This review provides an overview of the recent methodological developments in the field of population genetics, specifically methods used to infer ancient population history and investigate natural selection using large-sample, large-scale genetic data. Several open questions are also discussed at the end of the review.
Collapse
Affiliation(s)
- Hua Chen
- Center for Computational Genomics, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101,
| |
Collapse
|
12
|
McElroy K, Thomas T, Luciani F. Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions. MICROBIAL INFORMATICS AND EXPERIMENTATION 2014; 4:1. [PMID: 24428920 PMCID: PMC3902414 DOI: 10.1186/2042-5783-4-1] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 01/07/2014] [Indexed: 12/15/2022]
Abstract
Deep sequencing harnesses the high throughput nature of next generation sequencing technologies to generate population samples, treating information contained in individual reads as meaningful. Here, we review applications of deep sequencing to pathogen evolution. Pioneering deep sequencing studies from the virology literature are discussed, such as whole genome Roche-454 sequencing analyses of the dynamics of the rapidly mutating pathogens hepatitis C virus and HIV. Extension of the deep sequencing approach to bacterial populations is then discussed, including the impacts of emerging sequencing technologies. While it is clear that deep sequencing has unprecedented potential for assessing the genetic structure and evolutionary history of pathogen populations, bioinformatic challenges remain. We summarise current approaches to overcoming these challenges, in particular methods for detecting low frequency variants in the context of sequencing error and reconstructing individual haplotypes from short reads.
Collapse
Affiliation(s)
- Kerensa McElroy
- Centre for Marine Bio-Innovation and School of Biotechnology and Biomolecular Sciences, UNSW, Sydney, NSW 2052, Australia.
| | | | | |
Collapse
|
13
|
Korneliussen TS, Moltke I, Albrechtsen A, Nielsen R. Calculation of Tajima's D and other neutrality test statistics from low depth next-generation sequencing data. BMC Bioinformatics 2013; 14:289. [PMID: 24088262 PMCID: PMC4015034 DOI: 10.1186/1471-2105-14-289] [Citation(s) in RCA: 169] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2013] [Accepted: 09/25/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima's D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. However, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions. RESULTS We have developed an approach that accommodates the uncertainty of the data when calculating site frequency based neutrality test statistics. A salient feature of this approach is that it implicitly solves the problems of varying sequencing depth, missing data and avoids the need to infer variable sites for the analysis and thereby avoids ascertainment problems introduced by a SNP discovery process. CONCLUSION Using an empirical Bayes approach for fast computations, we show that this method produces results for low-coverage NGS data comparable to those achieved when the genotypes are known without uncertainty. We also validate the method in an analysis of data from the 1000 genomes project. The method is implemented in a fast framework which enables researchers to perform these neutrality tests on a genome-wide scale.
Collapse
Affiliation(s)
- Thorfinn Sand Korneliussen
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Oestervoldgade 5-7, DK-1350, Copenhagen, Denmark.
| | | | | | | |
Collapse
|
14
|
Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and SNP data. PLoS Genet 2013; 9:e1003905. [PMID: 24204310 PMCID: PMC3812088 DOI: 10.1371/journal.pgen.1003905] [Citation(s) in RCA: 907] [Impact Index Per Article: 75.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2013] [Accepted: 09/11/2013] [Indexed: 01/09/2023] Open
Abstract
We introduce a flexible and robust simulation-based framework to infer demographic parameters from the site frequency spectrum (SFS) computed on large genomic datasets. We show that our composite-likelihood approach allows one to study evolutionary models of arbitrary complexity, which cannot be tackled by other current likelihood-based methods. For simple scenarios, our approach compares favorably in terms of accuracy and speed with ∂a∂i, the current reference in the field, while showing better convergence properties for complex models. We first apply our methodology to non-coding genomic SNP data from four human populations. To infer their demographic history, we compare neutral evolutionary models of increasing complexity, including unsampled populations. We further show the versatility of our framework by extending it to the inference of demographic parameters from SNP chips with known ascertainment, such as that recently released by Affymetrix to study human origins. Whereas previous ways of handling ascertained SNPs were either restricted to a single population or only allowed the inference of divergence time between a pair of populations, our framework can correctly infer parameters of more complex models including the divergence of several populations, bottlenecks and migration. We apply this approach to the reconstruction of African demography using two distinct ascertained human SNP panels studied under two evolutionary models. The two SNP panels lead to globally very similar estimates and confidence intervals, and suggest an ancient divergence (>110 Ky) between Yoruba and San populations. Our methodology appears well suited to the study of complex scenarios from large genomic data sets.
Collapse
Affiliation(s)
- Laurent Excoffier
- CMPG, Institute of Ecology and Evolution, Berne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Isabelle Dupanloup
- CMPG, Institute of Ecology and Evolution, Berne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Emilia Huerta-Sánchez
- Center for Theoretical Evolutionary Genomics, Department of Integrative Biology, University of California, Berkeley, Berkeley, California, United States of America
| | - Vitor C. Sousa
- CMPG, Institute of Ecology and Evolution, Berne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Matthieu Foll
- CMPG, Institute of Ecology and Evolution, Berne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| |
Collapse
|
15
|
Ezawa K, Innan H. Theoretical framework of population genetics with somatic mutations taken into account: application to copy number variations in humans. Heredity (Edinb) 2013; 111:364-74. [PMID: 23981956 DOI: 10.1038/hdy.2013.59] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2012] [Revised: 01/05/2013] [Accepted: 05/10/2013] [Indexed: 11/09/2022] Open
Abstract
Traditionally, population genetics focuses on the dynamics of frequencies of alleles acquired by mutations on germ-lines, because only such mutations are heritable. Typical genotyping experiments, however, use DNA from some somatic tissues such as blood, which harbors somatic mutations at the current generation in addition to germ-line mutations accumulated since the most recent common ancestor of the sample. This common practice may sometimes cause erroneous interpretations of polymorphism data, unless we properly understand the role of somatic mutations in population genetics. We here introduce a very basic theoretical framework of population genetics with somatic mutations taken into account. It is easy to imagine that somatic mutations at the current generation simply add individual-specific variations, as errors in mutation detection do. Our theory quantifies this increment under various conditions. We find that the major contribution of somatic mutations plus errors is to very rare variants, particularly to singletons. The relative contribution is markedly large when mutations are deleterious. Because negative selection also increases rare variants, it is important to distinguish the roles of these mutually confounding factors when we interpret the data, even after correcting for demography. We apply this theory to human copy number variations (CNVs), for which the composite effect of somatic mutations and errors may not be negligible. Using genome-wide CNV data, we demonstrate how the joint action of the two factors, selection and somatic mutations plus errors, shapes the observed pattern of polymorphism.
Collapse
Affiliation(s)
- K Ezawa
- School of Advanced Sciences, The Graduate University for Advanced Studies, Hayama, Japan
| | | |
Collapse
|
16
|
Abstract
High-throughput shotgun sequence data make it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual's genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual are limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual's genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step that calls genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first, by its performance on simulated sequence data and, second, on real sequence data where we obtain estimates using low-coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse worldwide populations and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show how we can use filters to correct for the confounding arising from sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher-coverage data.
Collapse
|
17
|
Prabhakara S, Malhotra R, Acharya R, Poss M. Mutant-Bin: Unsupervised Haplotype Estimation of Viral Population Diversity Without Reference Genome. J Comput Biol 2013; 20:453-63. [DOI: 10.1089/cmb.2012.0174] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Shruthi Prabhakara
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA
| | - Raunaq Malhotra
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA
| | - Raj Acharya
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA
| | - Mary Poss
- Department of Biology and Veterinary and Biomedical Sciences, Pennsylvania State University, University Park, PA
| |
Collapse
|
18
|
Burger PA, Palmieri N. Estimating the population mutation rate from a de novo assembled Bactrian camel genome and cross-species comparison with dromedary ESTs. J Hered 2013; 105:839-46. [PMID: 23454912 PMCID: PMC4201309 DOI: 10.1093/jhered/est005] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
The Bactrian camel (Camelus bactrianus) and the dromedary (Camelus dromedarius) are among the last species that have been domesticated around 3000-6000 years ago. During domestication, strong artificial (anthropogenic) selection has shaped the livestock, creating a huge amount of phenotypes and breeds. Hence, domestic animals represent a unique resource to understand the genetic basis of phenotypic variation and adaptation. Similar to its late domestication history, the Bactrian camel is also among the last livestock animals to have its genome sequenced and deciphered. As no genomic data have been available until recently, we generated a de novo assembly by shotgun sequencing of a single male Bactrian camel. We obtained 1.6 Gb genomic sequences, which correspond to more than half of the Bactrian camel's genome. The aim of this study was to identify heterozygous single-nucleotide polymorphisms (SNPs) and to estimate population parameters and nucleotide diversity based on an individual camel. With an average 6.6-fold coverage, we detected over 116 000 heterozygous SNPs and recorded a genome-wide nucleotide diversity similar to that of other domesticated ungulates. More than 20 000 (85%) dromedary expressed sequence tags successfully aligned to our genomic draft. Our results provide a template for future association studies targeting economically relevant traits and to identify changes underlying the process of camel domestication and environmental adaptation.
Collapse
Affiliation(s)
- Pamela A Burger
- From the Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, 1210 Wien, Austria/Europe
| | - Nicola Palmieri
- From the Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, 1210 Wien, Austria/Europe
| |
Collapse
|
19
|
McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield RT. Applications of next-generation sequencing to phylogeography and phylogenetics. Mol Phylogenet Evol 2013; 66:526-38. [DOI: 10.1016/j.ympev.2011.12.007] [Citation(s) in RCA: 445] [Impact Index Per Article: 37.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2011] [Revised: 12/02/2011] [Accepted: 12/05/2011] [Indexed: 01/09/2023]
|
20
|
Nielsen R, Korneliussen T, Albrechtsen A, Li Y, Wang J. SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data. PLoS One 2012; 7:e37558. [PMID: 22911679 PMCID: PMC3404070 DOI: 10.1371/journal.pone.0037558] [Citation(s) in RCA: 255] [Impact Index Per Article: 19.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2011] [Accepted: 04/25/2012] [Indexed: 12/22/2022] Open
Abstract
We present a statistical framework for estimation and application of sample allele frequency spectra from New-Generation Sequencing (NGS) data. In this method, we first estimate the allele frequency spectrum using maximum likelihood. In contrast to previous methods, the likelihood function is calculated using a dynamic programming algorithm and numerically optimized using analytical derivatives. We then use a bayesian method for estimating the sample allele frequency in a single site, and show how the method can be used for genotype calling and SNP calling. We also show how the method can be extended to various other cases including cases with deviations from Hardy-Weinberg equilibrium. We evaluate the statistical properties of the methods using simulations and by application to a real data set.
Collapse
Affiliation(s)
- Rasmus Nielsen
- BGI-Shenzhen, Shenzhen, China
- Departments of Integrative Biology and Statistics, University of California, Berkeley, California, United States of America
- Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | | | | | | | - Jun Wang
- BGI-Shenzhen, Shenzhen, China
- Department of Biology, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
21
|
Liu B, Faller LL, Klitgord N, Mazumdar V, Ghodsi M, Sommer DD, Gibbons TR, Treangen TJ, Chang YC, Li S, Stine OC, Hasturk H, Kasif S, Segrè D, Pop M, Amar S. Deep sequencing of the oral microbiome reveals signatures of periodontal disease. PLoS One 2012; 7:e37919. [PMID: 22675498 PMCID: PMC3366996 DOI: 10.1371/journal.pone.0037919] [Citation(s) in RCA: 269] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2011] [Accepted: 04/30/2012] [Indexed: 11/18/2022] Open
Abstract
The oral microbiome, the complex ecosystem of microbes inhabiting the human mouth, harbors several thousands of bacterial types. The proliferation of pathogenic bacteria within the mouth gives rise to periodontitis, an inflammatory disease known to also constitute a risk factor for cardiovascular disease. While much is known about individual species associated with pathogenesis, the system-level mechanisms underlying the transition from health to disease are still poorly understood. Through the sequencing of the 16S rRNA gene and of whole community DNA we provide a glimpse at the global genetic, metabolic, and ecological changes associated with periodontitis in 15 subgingival plaque samples, four from each of two periodontitis patients, and the remaining samples from three healthy individuals. We also demonstrate the power of whole-metagenome sequencing approaches in characterizing the genomes of key players in the oral microbiome, including an unculturable TM7 organism. We reveal the disease microbiome to be enriched in virulence factors, and adapted to a parasitic lifestyle that takes advantage of the disrupted host homeostasis. Furthermore, diseased samples share a common structure that was not found in completely healthy samples, suggesting that the disease state may occupy a narrow region within the space of possible configurations of the oral microbiome. Our pilot study demonstrates the power of high-throughput sequencing as a tool for understanding the role of the oral microbiome in periodontal disease. Despite a modest level of sequencing (~2 lanes Illumina 76 bp PE) and high human DNA contamination (up to ~90%) we were able to partially reconstruct several oral microbes and to preliminarily characterize some systems-level differences between the healthy and diseased oral microbiomes.
Collapse
Affiliation(s)
- Bo Liu
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
- Department of Computer Science, University of Maryland, College Park, Maryland, United States of America
| | - Lina L. Faller
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Niels Klitgord
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Varun Mazumdar
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Mohammad Ghodsi
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
- Department of Computer Science, University of Maryland, College Park, Maryland, United States of America
| | - Daniel D. Sommer
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Theodore R. Gibbons
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
- Biological Sciences Graduate Program, University of Maryland, College Park, Maryland, United States of America
| | - Todd J. Treangen
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
- The McKusick-Nathans Institute for Genetic Medicine, The Johns Hopkins University School of Medicine, Baltimore, Maryland, United States of America
| | - Yi-Chien Chang
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Shan Li
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, United States of America
| | - O. Colin Stine
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, United States of America
| | - Hatice Hasturk
- The Forysth Institute, Department of Periodontology, Cambridge, Massachusetts, United States of America
| | - Simon Kasif
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- Children’s Informatics Program, Harvard-Massachusetts Institute of Technology Division of Health Sciences and Technology, Boston, Massachusetts, United States of America
| | - Daniel Segrè
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biology, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Mihai Pop
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
- Department of Computer Science, University of Maryland, College Park, Maryland, United States of America
- Biological Sciences Graduate Program, University of Maryland, College Park, Maryland, United States of America
| | - Salomon Amar
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Center for Anti-Inflammatory Therapeutics; Boston University Goldman School of Dental Medicine, Boston, Massachusetts, United States of America
| |
Collapse
|
22
|
Crawford JE, Lazzaro BP. Assessing the accuracy and power of population genetic inference from low-pass next-generation sequencing data. Front Genet 2012; 3:66. [PMID: 22536207 PMCID: PMC3334522 DOI: 10.3389/fgene.2012.00066] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2012] [Accepted: 04/05/2012] [Indexed: 01/17/2023] Open
Abstract
Next-generation sequencing (NGS) technologies have made it possible to address population genetic questions in almost any system, but high error rates associated with such data can introduce significant biases into downstream analyses, necessitating careful experimental design and interpretation in studies based on short-read sequencing. Exploration of population genetic analyses based on NGS has revealed some of the potential biases, but previous work has emphasized parameters relevant to human population genetics and further examination of parameters relevant to other systems is necessary, including situations where sample sizes are small and genetic variation is high. To assess experimental power to address several principal objectives of population genetic studies under these conditions, we simulated population samples under selective sweep, population growth, and population subdivision models and tested the power to accurately infer population genetic parameters from sequence polymorphism data obtained through simulated 4×, 8×, and 15× read depth sequence data. We found that estimates of population genetic differentiation and population growth parameters were systematically biased when inference was based on 4× sequencing, but biases were markedly reduced at even 8× read depth. We also found that the power to identify footprints of positive selection depends on an interaction between read depth and the strength of selection, with strong selection being recovered consistently at all read depths, but weak selection requiring deeper read depths for reliable detection. Although we have explored only a small subset of the many possible experimental designs and population genetic models, using only one SNP-calling approach, our results reveal some general patterns and provide some assessment of what biases could be expected under similar experimental structures.
Collapse
|
23
|
Liu X. jPopGen Suite: population genetic analysis of DNA polymorphism from nucleotide sequences with errors. Methods Ecol Evol 2012; 3:624-627. [PMID: 22905315 DOI: 10.1111/j.2041-210x.2012.00194.x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
1. Next-generation sequencing (NGS) is being increasingly used in ecological and evolutionary studies. Though promising, NGS is known to be error-prone. Sequencing error can cause significant bias for population genetic analysis of a sequence sample.2. We present jPopGen Suite, an integrated tool for population genetic analysis of DNA polymorphisms from nucleotide sequences. It is specially designed for data with a non-negligible error rate, although it serves well for "error-free" data. It implements several methods for estimating the population mutation rate, population growth rate, and conducting neutrality tests.3. jPopGen Suite facilitates the population genetic analysis of NGS data in various applications, and is freely available for non-commercial users at http://sites.google.com/site/jpopgen/.
Collapse
Affiliation(s)
- Xiaoming Liu
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, 1200 Herman Pressler Drive, Houston, TX 77030, USA
| |
Collapse
|
24
|
Amorim J, Vidal R, Lacerda-Junior G, Dias J, Brendel M, Rezende R, Cascardo J. A simple boiling-based DNA extraction for RAPD profiling of landfarm soil to provide representative metagenomic content. GENETICS AND MOLECULAR RESEARCH 2012; 11:182-9. [DOI: 10.4238/2012.january.27.5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
25
|
Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 2011; 12:443-51. [PMID: 21587300 PMCID: PMC3593722 DOI: 10.1038/nrg2986] [Citation(s) in RCA: 894] [Impact Index Per Article: 63.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Meaningful analysis of next-generation sequencing (NGS) data, which are produced extensively by genetics and genomics studies, relies crucially on the accurate calling of SNPs and genotypes. Recently developed statistical methods both improve and quantify the considerable uncertainty associated with genotype calling, and will especially benefit the growing number of studies using low- to medium-coverage data. We review these methods and provide a guide for their use in NGS studies.
Collapse
Affiliation(s)
- Rasmus Nielsen
- Department of Integrative Biology, University of California, Berkeley, CA 94720, USA.
| | | | | | | |
Collapse
|
26
|
Luca F, Hudson RR, Witonsky DB, Di Rienzo A. A reduced representation approach to population genetic analyses and applications to human evolution. Genome Res 2011; 21:1087-98. [PMID: 21628451 DOI: 10.1101/gr.119792.110] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Second-generation sequencing technologies allow surveys of sequence variation on an unprecedented scale. However, despite the rapid decrease in sequencing costs, collecting whole-genome sequence data on a population scale is still prohibitive for many laboratories. We have implemented an inexpensive, reduced representation protocol for preparing resequencing targets, and we have developed the analytical tools necessary for making population genetic inferences. This approach can be applied to any species for which a draft or complete reference genome sequence is available. The new tools we have developed include methods for aligning reads, calling genotypes, and incorporating sample-specific sequencing error rates in the estimate of evolutionary parameters. When applied to 19 individuals from a total of 18 human populations, our approach allowed sampling regions that are largely overlapping across individuals and that are representative of the entire genome. The resequencing data were used to test the serial founder model of human dispersal and to estimate the time of the Out of Africa migration. Our results also represent the first attempt to provide a time frame for the colonization of Australia based on large-scale resequencing data.
Collapse
Affiliation(s)
- Francesca Luca
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| | | | | | | |
Collapse
|
27
|
Jenkins PA, Song YS. The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. Theor Popul Biol 2011; 80:158-73. [PMID: 21550359 DOI: 10.1016/j.tpb.2011.04.001] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2011] [Revised: 04/14/2011] [Accepted: 04/19/2011] [Indexed: 11/28/2022]
Abstract
The sample frequency spectrum of a segregating site is the probability distribution of a sample of alleles from a genetic locus, conditional on observing the sample to be polymorphic. This distribution is widely used in population genetic inferences, including statistical tests of neutrality in which a skew in the observed frequency spectrum across independent sites is taken as a signature of departure from neutral evolution. Theoretical aspects of the frequency spectrum have been well studied and several interesting results are available, but they are usually under the assumption that a site has undergone at most one mutation event in the history of the sample. Here, we extend previous theoretical results by allowing for at most two mutation events per site, under a general finite allele model in which the mutation rate is independent of current allelic state but the transition matrix is otherwise completely arbitrary. Our results apply to both nested and nonnested mutations. Only the former has been addressed previously, whereas here we show it is the latter that is more likely to be observed except for very small sample sizes. Further, for any mutation transition matrix, we obtain the joint sample frequency spectrum of the two mutant alleles at a triallelic site, and derive a closed-form formula for the expected age of the younger of the two mutations given their frequencies in the population. Several large-scale resequencing projects for various species are presently under way and the resulting data will include some triallelic polymorphisms. The theoretical results described in this paper should prove useful in population genomic analyses of such data.
Collapse
Affiliation(s)
- Paul A Jenkins
- Computer Science Division, University of California, Berkeley, USA.
| | | |
Collapse
|
28
|
Detecting directional selection in the presence of recent admixture in African-Americans. Genetics 2010; 187:823-35. [PMID: 21196524 DOI: 10.1534/genetics.110.122739] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
We investigate the performance of tests of neutrality in admixed populations using plausible demographic models for African-American history as well as resequencing data from African and African-American populations. The analysis of both simulated and human resequencing data suggests that recent admixture does not result in an excess of false-positive results for neutrality tests based on the frequency spectrum after accounting for the population growth in the parental African population. Furthermore, when simulating positive selection, Tajima's D, Fu and Li's D, and haplotype homozygosity have lower power to detect population-specific selection using individuals sampled from the admixed population than from the nonadmixed population. Fay and Wu's H test, however, has more power to detect selection using individuals from the admixed population than from the nonadmixed population, especially when the selective sweep ended long ago. Our results have implications for interpreting recent genome-wide scans for positive selection in human populations.
Collapse
|
29
|
Haubold B, Reed FA, Pfaffelhuber P. Alignment-free estimation of nucleotide diversity. ACTA ACUST UNITED AC 2010; 27:449-55. [PMID: 21156730 DOI: 10.1093/bioinformatics/btq689] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Sequencing capacity is currently growing more rapidly than CPU speed, leading to an analysis bottleneck in many genome projects. Alignment-free sequence analysis methods tend to be more efficient than their alignment-based counterparts. They may, therefore, be important in the long run for keeping sequence analysis abreast with sequencing. RESULTS We derive and implement an alignment-free estimator of the number of pairwise mismatches, . Our implementation of , pim, is based on an enhanced suffix array and inherits the superior time and memory efficiency of this data structure. Simulations demonstrate that is accurate if mutations are distributed randomly along the chromosome. While real data often deviates from this ideal, remains useful for identifying regions of low genetic diversity using a sliding window approach. We demonstrate this by applying it to the complete genomes of 37 strains of Drosophila melanogaster, and to the genomes of two closely related Drosophila species, D.simulans and D.sechellia. In both cases, we detect the diversity minimum and discuss its biological implications.
Collapse
Affiliation(s)
- Bernhard Haubold
- Department of Evolutionary Genetics, Albert-Ludwigs University, Freiburg, Germany
| | | | | |
Collapse
|
30
|
Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat Genet 2010; 42:969-72. [PMID: 20890277 DOI: 10.1038/ng.680] [Citation(s) in RCA: 254] [Impact Index Per Article: 16.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2010] [Accepted: 09/08/2010] [Indexed: 12/19/2022]
Abstract
Targeted capture combined with massively parallel exome sequencing is a promising approach to identify genetic variants implicated in human traits. We report exome sequencing of 200 individuals from Denmark with targeted capture of 18,654 coding genes and sequence coverage of each individual exome at an average depth of 12-fold. On average, about 95% of the target regions were covered by at least one read. We identified 121,870 SNPs in the sample population, including 53,081 coding SNPs (cSNPs). Using a statistical method for SNP calling and an estimation of allelic frequencies based on our population data, we derived the allele frequency spectrum of cSNPs with a minor allele frequency greater than 0.02. We identified a 1.8-fold excess of deleterious, non-syonomyous cSNPs over synonymous cSNPs in the low-frequency range (minor allele frequencies between 2% and 5%). This excess was more pronounced for X-linked SNPs, suggesting that deleterious substitutions are primarily recessive.
Collapse
|
31
|
Haubold B, Pfaffelhuber P, Lynch M. mlRho - a program for estimating the population mutation and recombination rates from shotgun-sequenced diploid genomes. Mol Ecol 2010; 19 Suppl 1:277-84. [PMID: 20331786 DOI: 10.1111/j.1365-294x.2009.04482.x] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Improvements in sequencing technology over the past 5 years are leading to routine application of shotgun sequencing in the fields of ecology and evolution. However, the theory to estimate evolutionary parameters from these data is still being worked out. Here we present an extension and implementation of part of this theory, mlRho. This program can efficiently compute the following three maximum likelihood estimators based on shotgun sequence data obtained from single diploid individuals: the population mutation rate (4N(e)mu), the sequencing error rate, and the population recombination rate (4N(e)c). We demonstrate the accuracy of mlRho by applying it to simulated data sets. In addition, we analyse the genomes of the sea squirt Ciona intestinalis and the water flea Daphnia pulex. Ciona intestinalis is an obligate outcrosser, while D. pulex is a cyclic parthenogen, and we discuss how these contrasting life histories are reflected in our parameter estimates. The program mlRho is freely available from http://guanine.evolbio.mpg.de/mlRho.
Collapse
Affiliation(s)
- Bernhard Haubold
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany.
| | | | | |
Collapse
|
32
|
Morgan JL, Darling AE, Eisen JA. Metagenomic sequencing of an in vitro-simulated microbial community. PLoS One 2010; 5:e10209. [PMID: 20419134 PMCID: PMC2855710 DOI: 10.1371/journal.pone.0010209] [Citation(s) in RCA: 146] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2009] [Accepted: 03/12/2010] [Indexed: 12/03/2022] Open
Abstract
Background Microbial life dominates the earth, but many species are difficult or even impossible to study under laboratory conditions. Sequencing DNA directly from the environment, a technique commonly referred to as metagenomics, is an important tool for cataloging microbial life. This culture-independent approach involves collecting samples that include microbes in them, extracting DNA from the samples, and sequencing the DNA. A sample may contain many different microorganisms, macroorganisms, and even free-floating environmental DNA. A fundamental challenge in metagenomics has been estimating the abundance of organisms in a sample based on the frequency with which the organism's DNA was observed in reads generated via DNA sequencing. Methodology/Principal Findings We created mixtures of ten microbial species for which genome sequences are known. Each mixture contained an equal number of cells of each species. We then extracted DNA from the mixtures, sequenced the DNA, and measured the frequency with which genomic regions from each organism was observed in the sequenced DNA. We found that the observed frequency of reads mapping to each organism did not reflect the equal numbers of cells that were known to be included in each mixture. The relative organism abundances varied significantly depending on the DNA extraction and sequencing protocol utilized. Conclusions/Significance We describe a new data resource for measuring the accuracy of metagenomic binning methods, created by in vitro-simulation of a metagenomic community. Our in vitro simulation can be used to complement previous in silico benchmark studies. In constructing a synthetic community and sequencing its metagenome, we encountered several sources of observation bias that likely affect most metagenomic experiments to date and present challenges for comparative metagenomic studies. DNA preparation methods have a particularly profound effect in our study, implying that samples prepared with different protocols are not suitable for comparative metagenomics.
Collapse
Affiliation(s)
- Jenna L. Morgan
- Department of Medical Microbiology and Immunology, University of California Davis, Davis, California, United States of America
- Department of Evolution and Ecology, University of California Davis, Davis, California, United States of America
- United States Department of Energy Joint Genome Institute, Walnut Creek, California, United States of America
| | - Aaron E. Darling
- Department of Medical Microbiology and Immunology, University of California Davis, Davis, California, United States of America
- Department of Evolution and Ecology, University of California Davis, Davis, California, United States of America
| | - Jonathan A. Eisen
- Department of Medical Microbiology and Immunology, University of California Davis, Davis, California, United States of America
- Department of Evolution and Ecology, University of California Davis, Davis, California, United States of America
- United States Department of Energy Joint Genome Institute, Walnut Creek, California, United States of America
- * E-mail:
| |
Collapse
|
33
|
Pool JE, Hellmann I, Jensen JD, Nielsen R. Population genetic inference from genomic sequence variation. Genome Res 2010; 20:291-300. [PMID: 20067940 DOI: 10.1101/gr.079509.108] [Citation(s) in RCA: 147] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Population genetics has evolved from a theory-driven field with little empirical data into a data-driven discipline in which genome-scale data sets test the limits of available models and computational analysis methods. In humans and a few model organisms, analyses of whole-genome sequence polymorphism data are currently under way. And in light of the falling costs of next-generation sequencing technologies, such studies will soon become common in many other organisms as well. Here, we assess the challenges to analyzing whole-genome sequence polymorphism data, and we discuss the potential of these data to yield new insights concerning population history and the genomic prevalence of natural selection.
Collapse
Affiliation(s)
- John E Pool
- Department of Integrative Biology, University of California, Berkeley, Berkeley, California 94720, USA
| | | | | | | |
Collapse
|
34
|
Liu X, Fu YX, Maxwell TJ, Boerwinkle E. Estimating population genetic parameters and comparing model goodness-of-fit using DNA sequences with error. Genome Res 2009; 20:101-9. [PMID: 19952140 DOI: 10.1101/gr.097543.109] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
It is known that sequencing error can bias estimation of evolutionary or population genetic parameters. This problem is more prominent in deep resequencing studies because of their large sample size n, and a higher probability of error at each nucleotide site. We propose a new method based on the composite likelihood of the observed SNP configurations to infer population mutation rate theta = 4N(e)micro, population exponential growth rate R, and error rate epsilon, simultaneously. Using simulation, we show the combined effects of the parameters, theta, n, epsilon, and R on the accuracy of parameter estimation. We compared our maximum composite likelihood estimator (MCLE) of theta with other theta estimators that take into account the error. The results show the MCLE performs well when the sample size is large or the error rate is high. Using parametric bootstrap, composite likelihood can also be used as a statistic for testing the model goodness-of-fit of the observed DNA sequences. The MCLE method is applied to sequence data on the ANGPTL4 gene in 1832 African American and 1045 European American individuals.
Collapse
Affiliation(s)
- Xiaoming Liu
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA.
| | | | | | | |
Collapse
|
35
|
Abstract
Variation in gene expression constitutes an important source of biological variability within and between populations that is likely to contribute significantly to phenotypic diversity. Recent conceptual, technical, and methodological advances have enabled the genome-scale dissection of transcriptional variation. Here, we outline common approaches for detecting gene expression quantitative trait loci, and summarize the insights gleaned from these studies regarding the genetic architecture of transcriptional variation and the nature of regulatory alleles. Particular emphasis is placed on human studies, and we discuss experimental designs that ensure that increasingly large and complex studies continue to advance our understanding of gene expression variation. We conclude by discussing the evolution of gene expression levels, and we explore prospects for leveraging new technological developments to investigate inherited variation in gene expression in even greater depth.
Collapse
Affiliation(s)
- Daniel A Skelly
- Department of Genome Sciences, University of Washington, Seattle, Washington, 98195, USA.
| | | | | |
Collapse
|
36
|
Johnson PLF, Slatkin M. Inference of microbial recombination rates from metagenomic data. PLoS Genet 2009; 5:e1000674. [PMID: 19798447 PMCID: PMC2745702 DOI: 10.1371/journal.pgen.1000674] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2009] [Accepted: 09/02/2009] [Indexed: 11/18/2022] Open
Abstract
Metagenomic sequencing projects from environments dominated by a small number of species produce genome-wide population samples. We present a two-site composite likelihood estimator of the scaled recombination rate, rho = 2N(e)c, that operates on metagenomic assemblies in which each sequenced fragment derives from a different individual. This new estimator properly accounts for sequencing error, as quantified by per-base quality scores, and missing data, as inferred from the placement of reads in a metagenomic assembly. We apply our estimator to data from a sludge metagenome project to demonstrate how this method will elucidate the rates of exchange of genetic material in natural microbial populations. Surprisingly, for a fixed amount of sequencing, this estimator has lower variance than similar methods that operate on more traditional population genetic samples of comparable size. In addition, we can infer variation in recombination rate across the genome because metagenomic projects sample genetic diversity genome-wide, not just at particular loci. The method itself makes no assumption specific to microbial populations, opening the door for application to any mixed population sample where the number of individuals sampled is much greater than the number of fragments sequenced.
Collapse
Affiliation(s)
- Philip L F Johnson
- Biophysics Graduate Group, University of California Berkeley, Berkeley, California, United States of America.
| | | |
Collapse
|
37
|
Barrick JE, Lenski RE. Genome-wide mutational diversity in an evolving population of Escherichia coli. COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY 2009; 74:119-29. [PMID: 19776167 DOI: 10.1101/sqb.2009.74.018] [Citation(s) in RCA: 127] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
The level of genetic variation in a population is the result of a dynamic tension between evolutionary forces. Mutations create variation, certain frequency-dependent interactions may preserve diversity, and natural selection purges variation. New sequencing technologies offer unprecedented opportunities to discover and characterize the diversity present in evolving microbial populations on a whole-genome scale. By sequencing mixed-population samples, we have identified single-nucleotide polymorphisms (SNPs) present at various points in the history of an Escherichia coli population that has evolved for almost 20 years from a founding clone. With 50-fold genome coverage, we were able to catch beneficial mutations as they swept to fixation, discover contending beneficial alleles that were eliminated by clonal interference, and detect other minor variants possibly adapted to a new ecological niche. Additionally, there was a dramatic increase in genetic diversity late in the experiment after a mutator phenotype evolved. Still finer-resolution details of the structure of genetic variation and how it changes over time in microbial evolution experiments will enable new applications and quantitative tests of population genetic theory.
Collapse
Affiliation(s)
- J E Barrick
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI 48824, USA
| | | |
Collapse
|
38
|
Knudsen B, Miyamoto MM. Accurate and fast methods to estimate the population mutation rate from error prone sequences. BMC Bioinformatics 2009; 10:247. [PMID: 19671163 PMCID: PMC2746815 DOI: 10.1186/1471-2105-10-247] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2009] [Accepted: 08/11/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The population mutation rate (theta) remains one of the most fundamental parameters in genetics, ecology, and evolutionary biology. However, its accurate estimation can be seriously compromised when working with error prone data such as expressed sequence tags, low coverage draft sequences, and other such unfinished products. This study is premised on the simple idea that a random sequence error due to a chance accident during data collection or recording will be distributed within a population dataset as a singleton (i.e., as a polymorphic site where one sampled sequence exhibits a unique base relative to the common nucleotide of the others). Thus, one can avoid these random errors by ignoring the singletons within a dataset. RESULTS This strategy is implemented under an infinite sites model that focuses on only the internal branches of the sample genealogy where a shared polymorphism can arise (i.e., a variable site where each alternative base is represented by at least two sequences). This approach is first used to derive independently the same new Watterson and Tajima estimators of theta, as recently reported by Achaz 1 for error prone sequences. It is then used to modify the recent, full, maximum-likelihood model of Knudsen and Miyamoto 2, which incorporates various factors for experimental error and design with those for coalescence and mutation. These new methods are all accurate and fast according to evolutionary simulations and analyses of a real complex population dataset for the California seahare. CONCLUSION In light of these results, we recommend the use of these three new methods for the determination of theta from error prone sequences. In particular, we advocate the new maximum likelihood model as a starting point for the further development of more complex coalescent/mutation models that also account for experimental error and design.
Collapse
Affiliation(s)
| | - Michael M Miyamoto
- Department of Biology, Box 118525, University of Florida, Gainesville, Florida 32611-8525, USA
| |
Collapse
|
39
|
Downing T, Lynn DJ, Connell S, Lloyd AT, Bhuiyan AK, Silva P, Naqvi AN, Sanfo R, Sow RS, Podisi B, Hanotte O, O'Farrelly C, Bradley DG. Evidence of balanced diversity at the chicken interleukin 4 receptor alpha chain locus. BMC Evol Biol 2009; 9:136. [PMID: 19527513 PMCID: PMC3224688 DOI: 10.1186/1471-2148-9-136] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2009] [Accepted: 06/15/2009] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND The comparative analysis of genome sequences emerging for several avian species with the fully sequenced chicken genome enables the genome-wide investigation of selective processes in functionally important chicken genes. In particular, because of pathogenic challenges it is expected that genes involved in the chicken immune system are subject to particularly strong adaptive pressure. Signatures of selection detected by inter-species comparison may then be investigated at the population level in global chicken populations to highlight potentially relevant functional polymorphisms. RESULTS Comparative evolutionary analysis of chicken (Gallus gallus) and zebra finch (Taeniopygia guttata) genes identified interleukin 4 receptor alpha-chain (IL-4Ralpha), a key cytokine receptor as a candidate with a significant excess of substitutions at nonsynonymous sites, suggestive of adaptive evolution. Resequencing and detailed population genetic analysis of this gene in diverse village chickens from Asia and Africa, commercial broilers, and in outgroup species red jungle fowl (JF), grey JF, Ceylon JF, green JF, grey francolin and bamboo partridge, suggested elevated and balanced diversity across all populations at this gene, acting to preserve different high-frequency alleles at two nonsynonymous sites. CONCLUSION Haplotype networks indicate that red JF is the primary contributor of diversity at chicken IL-4Ralpha: the signature of variation observed here may be due to the effects of domestication, admixture and introgression, which produce high diversity. However, this gene is a key cytokine-binding receptor in the immune system, so balancing selection related to the host response to pathogens cannot be excluded.
Collapse
Affiliation(s)
- Tim Downing
- Smurfit Institute of Genetics, Trinity College, University of Dublin, Dublin, Ireland.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Liu X, Maxwell TJ, Boerwinkle E, Fu YX. Inferring population mutation rate and sequencing error rate using the SNP frequency spectrum in a sample of DNA sequences. Mol Biol Evol 2009; 26:1479-90. [PMID: 19318520 DOI: 10.1093/molbev/msp059] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
One challenge of analyzing samples of DNA sequences is to account for the nonnegligible polymorphisms produced by error when the sequencing error rate is high or the sample size is large. Specifically, those artificial sequence variations will bias the observed single nucleotide polymorphism (SNP) frequency spectrum, which in turn may further bias the estimators of the population mutation rate theta =4N mu for diploids. In this paper, we propose a new approach based on the generalized least squares (GLS) method to estimate theta, given a SNP frequency spectrum in a random sample of DNA sequences from a population. With this approach, error rate epsilon can be either known or unknown. In the latter case, epsilon can be estimated given an estimation of theta. Using coalescent simulation, we compared our estimators with other estimators of theta. The results showed that the GLS estimators are more efficient than other theta estimators with error, and the estimation of epsilon is usable in practice when the theta per bp is small. We demonstrate the application of the estimators with 10-kb noncoding region sequence sampled from a human population and provide suggestions for choosing theta estimators with error.
Collapse
Affiliation(s)
- Xiaoming Liu
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, TX, USA
| | | | | | | |
Collapse
|
41
|
Downing T, Lynn DJ, Connell S, Lloyd AT, Bhuiyan AKFH, Silva P, Naqvi AN, Sanfo R, Sow RS, Podisi B, O’Farrelly C, Hanotte O, Bradley DG. Contrasting evolution of diversity at two disease-associated chicken genes. Immunogenetics 2009; 61:303-14. [DOI: 10.1007/s00251-009-0359-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2008] [Accepted: 01/30/2009] [Indexed: 11/28/2022]
|
42
|
Abstract
As random shotgun metagenomic projects proliferate and become the dominant source of publicly available sequence data, procedures for the best practices in their execution and analysis become increasingly important. Based on our experience at the Joint Genome Institute, we describe the chain of decisions accompanying a metagenomic project from the viewpoint of the bioinformatic analysis step by step. We guide the reader through a standard workflow for a metagenomic project beginning with presequencing considerations such as community composition and sequence data type that will greatly influence downstream analyses. We proceed with recommendations for sampling and data generation including sample and metadata collection, community profiling, construction of shotgun libraries, and sequencing strategies. We then discuss the application of generic sequence processing steps (read preprocessing, assembly, and gene prediction and annotation) to metagenomic data sets in contrast to genome projects. Different types of data analyses particular to metagenomes are then presented, including binning, dominant population analysis, and gene-centric analysis. Finally, data management issues are presented and discussed. We hope that this review will assist bioinformaticians and biologists in making better-informed decisions on their journey during a metagenomic project.
Collapse
|
43
|
Guazzaroni ME, Beloqui A, Golyshin PN, Ferrer M. Metagenomics as a new technological tool to gain scientific knowledge. World J Microbiol Biotechnol 2009. [DOI: 10.1007/s11274-009-9971-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
44
|
Wilmes P, Simmons SL, Denef VJ, Banfield JF. The dynamic genetic repertoire of microbial communities. FEMS Microbiol Rev 2008; 33:109-32. [PMID: 19054116 PMCID: PMC2704941 DOI: 10.1111/j.1574-6976.2008.00144.x] [Citation(s) in RCA: 84] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Community genomic data have revealed multiple levels of variation between and within microbial consortia. This variation includes large-scale differences in gene content between ecosystems as well as within-population sequence heterogeneity. In the present review, we focus specifically on how fine-scale variation within microbial and viral populations is apparent from community genomic data. A major unresolved question is how much of the observed variation is due to neutral vs. adaptive processes. Limited experimental data hint that some of this fine-scale variation may be in part functionally relevant, whereas sequence-based and modeling analyses suggest that much of it may be neutral. While methods for interpreting population genomic data are still in their infancy, we discuss current interpretations of existing datasets in the light of evolutionary processes and models. Finally, we highlight the importance of virus–host dynamics in generating and shaping within-population diversity.
Collapse
Affiliation(s)
- Paul Wilmes
- Department of Earth and Planetary Science, University of California at Berkeley, Berkeley, CA 94720, USA
| | | | | | | |
Collapse
|
45
|
Abstract
This article is concerned with statistical modeling of shotgun resequencing data and the use of such data for population genetic inference. We model data produced by sequencing-by-synthesis technologies such as the Solexa, 454, and polymerase colony (polony) systems, whose use is becoming increasingly widespread. We show how such data can be used to estimate evolutionary parameters (mutation and recombination rates), despite the fact that the data do not necessarily provide complete or aligned sequence information. We also present two refinements of our methods: one that is more robust to sequencing errors and another that can be used when no reference genome is available.
Collapse
|
46
|
Simmons SL, DiBartolo G, Denef VJ, Goltsman DSA, Thelen MP, Banfield JF. Population genomic analysis of strain variation in Leptospirillum group II bacteria involved in acid mine drainage formation. PLoS Biol 2008; 6:e177. [PMID: 18651792 PMCID: PMC2475542 DOI: 10.1371/journal.pbio.0060177] [Citation(s) in RCA: 97] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2008] [Accepted: 06/12/2008] [Indexed: 12/20/2022] Open
Abstract
Deeply sampled community genomic (metagenomic) datasets enable comprehensive analysis of heterogeneity in natural microbial populations. In this study, we used sequence data obtained from the dominant member of a low-diversity natural chemoautotrophic microbial community to determine how coexisting closely related individuals differ from each other in terms of gene sequence and gene content, and to uncover evidence of evolutionary processes that occur over short timescales. DNA sequence obtained from an acid mine drainage biofilm was reconstructed, taking into account the effects of strain variation, to generate a nearly complete genome tiling path for a Leptospirillum group II species closely related to L. ferriphilum (sampling depth ∼20×). The population is dominated by one sequence type, yet we detected evidence for relatively abundant variants (>99.5% sequence identity to the dominant type) at multiple loci, and a few rare variants. Blocks of other Leptospirillum group II types (∼94% sequence identity) have recombined into one or more variants. Variant blocks of both types are more numerous near the origin of replication. Heterogeneity in genetic potential within the population arises from localized variation in gene content, typically focused in integrated plasmid/phage-like regions. Some laterally transferred gene blocks encode physiologically important genes, including quorum-sensing genes of the LuxIR system. Overall, results suggest inter- and intrapopulation genetic exchange involving distinct parental genome types and implicate gain and loss of phage and plasmid genes in recent evolution of this Leptospirillum group II population. Population genetic analyses of single nucleotide polymorphisms indicate variation between closely related strains is not maintained by positive selection, suggesting that these regions do not represent adaptive differences between strains. Thus, the most likely explanation for the observed patterns of polymorphism is divergence of ancestral strains due to geographic isolation, followed by mixing and subsequent recombination. Communities of microbes in nature consist of a large number of distinct individuals. The variation in DNA sequence between these individuals contains a record of the evolutionary processes that have shaped each community. In most environments, however, the high number of distinct species makes obtaining information about the nature of this variation difficult or impossible. We obtained large amounts of sequence data for a natural community in an acid mine drainage system consisting of only a few species. This enabled us to reconstruct the genome of the dominant bacterium (Leptospirillum group II) and obtain detailed information about sequence variation between individuals, including differences in both gene content and gene sequence. Our analysis shows extensive recombination between closely related populations, as well as fewer instances of recombination between more distantly related individuals. Additionally, viruses and plasmids account for high variability in gene content between individuals. We conclude that sequence-level variation in this population is maintained through neutral processes (migration, recombination, and genetic drift) rather than natural selection. This suggests that closely related strains of the Leptospirillum group II population may not be ecologically distinct. Deep sequencing of a low-complexity microbial community revealed extensive recombination as well as polymorphic and gene content variation between individuals of the dominant organism. We show that strains defined by linked polymorphisms are not maintained by positive selection; instead, they are predominantly maintained by the forces of migration and drift.
Collapse
Affiliation(s)
- Sheri L Simmons
- Department of Earth and Planetary Science, University of California, Berkeley, Berkeley, California, United States of America
| | - Genevieve DiBartolo
- Department of Earth and Planetary Science, University of California, Berkeley, Berkeley, California, United States of America
| | - Vincent J Denef
- Department of Earth and Planetary Science, University of California, Berkeley, Berkeley, California, United States of America
| | - Daniela S. Aliaga Goltsman
- Department of Earth and Planetary Science, University of California, Berkeley, Berkeley, California, United States of America
| | - Michael P Thelen
- Chemistry Directorate, Lawrence Livermore National Laboratory, Livermore, California, United States of America
| | - Jillian F Banfield
- Department of Earth and Planetary Science, University of California, Berkeley, Berkeley, California, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
47
|
van Elsas JD, Costa R, Jansson J, Sjöling S, Bailey M, Nalin R, Vogel TM, van Overbeek L. The metagenomics of disease-suppressive soils - experiences from the METACONTROL project. Trends Biotechnol 2008; 26:591-601. [PMID: 18774191 DOI: 10.1016/j.tibtech.2008.07.004] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2008] [Revised: 07/14/2008] [Accepted: 07/22/2008] [Indexed: 11/29/2022]
Abstract
Soil teems with microbial genetic information that can be exploited for biotechnological innovation. Because only a fraction of the soil microbiota is cultivable, our ability to unlock this genetic complement has been hampered. Recently developed molecular tools, which make it possible to utilize genomic DNA from soil, can bypass cultivation and provide information on the collective soil metagenome with the aim to explore genes that encode functions of key interest to biotechnology. The metagenome of disease-suppressive soils is of particular interest given the expected prevalence of antibiotic biosynthetic clusters. However, owing to the complexity of soil microbial communities, deciphering this key genetic information is challenging. Here, we examine crucial issues and challenges that so far have hindered the metagenomic exploration of soil by drawing on experience from a trans-European project on disease-suppressive soils denoted METACONTROL.
Collapse
Affiliation(s)
- Jan Dirk van Elsas
- Department of Microbial Ecology, Centre for Ecological and Evolutionary Studies, University of Groningen, Kerklaan 30, 9750AA Haren, The Netherlands.
| | | | | | | | | | | | | | | |
Collapse
|
48
|
Four years of DNA barcoding: Current advances and prospects. INFECTION GENETICS AND EVOLUTION 2008; 8:727-36. [DOI: 10.1016/j.meegid.2008.05.005] [Citation(s) in RCA: 240] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/09/2008] [Revised: 05/23/2008] [Accepted: 05/27/2008] [Indexed: 11/21/2022]
|
49
|
Hellmann I, Mang Y, Gu Z, Li P, de la Vega FM, Clark AG, Nielsen R. Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. Genes Dev 2008; 18:1020-9. [PMID: 18411405 PMCID: PMC2493391 DOI: 10.1101/gr.074187.107] [Citation(s) in RCA: 79] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2007] [Accepted: 04/07/2008] [Indexed: 01/25/2023]
Abstract
We introduce a simple, broadly applicable method for obtaining estimates of nucleotide diversity from genomic shotgun sequencing data. The method takes into account the special nature of these data: random sampling of genomic segments from one or more individuals and a relatively high error rate for individual reads. Applying this method to data from the Celera human genome sequencing and SNP discovery project, we obtain estimates of nucleotide diversity in windows spanning the human genome and show that the diversity to divergence ratio is reduced in regions of low recombination. Furthermore, we show that the elevated diversity in telomeric regions is mainly due to elevated mutation rates and not due to decreased levels of background selection. However, we find indications that telomeres as well as centromeres experience greater impact from natural selection than intrachromosomal regions. Finally, we identify a number of genomic regions with increased or reduced diversity compared with the local level of human-chimpanzee divergence and the local recombination rate.
Collapse
Affiliation(s)
- Ines Hellmann
- Departments of Integrative Biology and Statistics, University of California, Berkeley, California 94720, USA.
| | | | | | | | | | | | | |
Collapse
|
50
|
Abstract
Supplementing metagenomic approaches to studying natural microbial communities with metatranscriptomics and metaproteomics should reap big dividends Metagenomics, the application of random shotgun sequencing to environmental samples, is a powerful approach for characterizing microbial communities. However, this method only represents the cornerstone of what can be achieved using a range of complementary technologies such as transcriptomics, proteomics, cell sorting and microfluidics. Together, these approaches hold great promise for the study of microbial ecology and evolution.
Collapse
Affiliation(s)
- Falk Warnecke
- Microbial Ecology Program, DOE Joint Genome Institute, Walnut Creek, CA 94598, USA
| | | |
Collapse
|