1
|
Xie S, Isaacs K, Becker G, Murdoch BM. A computational framework for improving genetic variants identification from 5,061 sheep sequencing data. J Anim Sci Biotechnol 2023; 14:127. [PMID: 37779189 PMCID: PMC10544426 DOI: 10.1186/s40104-023-00923-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2023] [Accepted: 08/01/2023] [Indexed: 10/03/2023] Open
Abstract
BACKGROUND Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation. Joint calling is routinely used to combine identified variants across multiple related samples. However, the improvement of variants identification using the mutual support information from multiple samples remains quite limited for population-scale genotyping. RESULTS In this study, we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples' data. The variants were accurately identified from multiple samples by using four steps: (1) Probabilities of variants from two widely used algorithms, GATK and Freebayes, were calculated by Poisson model incorporating base sequencing error potential; (2) The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification (rHID) variants database; (3) The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate (FDR) using rHID database; (4) To avoid the elimination of potentially true variants from rHID database, the variants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants. The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%-32% compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number (GPC5), scrapie pathology (PAPSS2), seasonal reproduction and litter size (GRM1), coat color (RAB27A), and lentivirus susceptibility (TMEM154). CONCLUSION The new method used the computational strategy to reduce the number of false positives, and simultaneously improve the identification of genetic variants. This strategy did not incur any extra cost by using any additional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding.
Collapse
Affiliation(s)
- Shangqian Xie
- Department of Animal, Veterinary & Food Sciences, University of Idaho, Moscow, ID, USA
| | | | - Gabrielle Becker
- Department of Animal, Veterinary & Food Sciences, University of Idaho, Moscow, ID, USA
| | - Brenda M Murdoch
- Department of Animal, Veterinary & Food Sciences, University of Idaho, Moscow, ID, USA.
| |
Collapse
|
2
|
Bagal UR, Gade L, Benedict K, Howell V, Christophe N, Gibbons-Burgener S, Hallyburton S, Ireland M, McCracken S, Metobo AK, Signs K, Warren KA, Litvintseva AP, Chow NA. A Phylogeographic Description of Histoplasma capsulatum in the United States. J Fungi (Basel) 2023; 9:884. [PMID: 37754992 PMCID: PMC10532573 DOI: 10.3390/jof9090884] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Revised: 08/23/2023] [Accepted: 08/24/2023] [Indexed: 09/28/2023] Open
Abstract
Histoplasmosis is one of the most under-diagnosed and under-reported endemic mycoses in the United States. Histoplasma capsulatum is the causative agent of this disease. To date, molecular epidemiologic studies detailing the phylogeographic structure of H. capsulatum in the United States have been limited. We conducted genomic sequencing using isolates from histoplasmosis cases reported in the United States. We identified North American Clade 2 (NAm2) as the most prevalent clade in the country. Despite high intra-clade diversity, isolates from Minnesota and Michigan cases were predominately clustered by state. Future work incorporating environmental sampling and veterinary surveillance may further elucidate the molecular epidemiology of H. capsulatum in the United States and how genomic sequencing can be applied to the surveillance and outbreak investigation of histoplasmosis.
Collapse
Affiliation(s)
- Ujwal R. Bagal
- Mycotic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
- ASRT Inc., Atlanta, GA 30080, USA
| | - Lalitha Gade
- Mycotic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Kaitlin Benedict
- Mycotic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Victoria Howell
- Kentucky Department for Public Health, Frankfort, KY 40601, USA
| | | | | | | | - Malia Ireland
- Minnesota Department of Health, St. Paul, MN 55101, USA
| | | | | | - Kimberly Signs
- Michigan Department of Health and Human Services, Lansing, MI 48933, USA
| | | | | | - Nancy A. Chow
- Mycotic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| |
Collapse
|
3
|
Alvarado-Cerón V, Muñiz-Castillo AI, León-Pech MG, Prada C, Arias-González JE. A decade of population genetics studies of scleractinian corals: A systematic review. MARINE ENVIRONMENTAL RESEARCH 2023; 183:105781. [PMID: 36371949 DOI: 10.1016/j.marenvres.2022.105781] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 10/11/2022] [Accepted: 10/12/2022] [Indexed: 06/16/2023]
Abstract
Coral reefs are the most diverse marine ecosystems. However, coral cover has decreased worldwide due to natural disturbances, climate change, and local anthropogenic drivers. In recent decades, various genetic methods and molecular markers have been developed to assess genetic diversity, structure, and connectivity in different coral species to determine the vulnerability of their populations. This review aims to identify population genetic studies of scleractinian corals in the last decade (2010-2020), and the techniques and molecular markers used. Bibliometric analysis was conducted to identify journals and authors working in this field. We then calculated the number of genetic studies by species and ecoregion based on data obtained from 178 studies found in Scopus and Web of Science. Coral Reefs and Molecular Ecology were the main journals published population genetics studies, and microsatellites are the most widely used molecular markers. The Caribbean, Australian Barrier Reef, and South Kuroshio in Japan are among the ecoregions with the most population genetics data. In contrast, we found limited information about the Coral Triangle, a region with the highest biodiversity and key to coral reef conservation. Notably, only 117 (out of 1500 described) scleractinian coral species have genetic studies. This review emphasizes which coral species have been studied and highlights remaining gaps and locations where such data is critical for coral conservation.
Collapse
Affiliation(s)
- Viridiana Alvarado-Cerón
- Departamento de Recursos del Mar, Centro de Investigación y de Estudios Avanzados del I.P.N., Unidad Mérida. Km. 6 Antigua carretera a Progreso, Cordemex, 97310, Mérida, Yucatán, Mexico.
| | - Aarón Israel Muñiz-Castillo
- Departamento de Recursos del Mar, Centro de Investigación y de Estudios Avanzados del I.P.N., Unidad Mérida. Km. 6 Antigua carretera a Progreso, Cordemex, 97310, Mérida, Yucatán, Mexico.
| | - María Geovana León-Pech
- Department of Biological Science, University of Rhode Island, 120 Flag Road, Kingston, RI, 02881, USA.
| | - Carlos Prada
- Department of Biological Science, University of Rhode Island, 120 Flag Road, Kingston, RI, 02881, USA.
| | - Jesús Ernesto Arias-González
- Departamento de Recursos del Mar, Centro de Investigación y de Estudios Avanzados del I.P.N., Unidad Mérida. Km. 6 Antigua carretera a Progreso, Cordemex, 97310, Mérida, Yucatán, Mexico.
| |
Collapse
|
4
|
Lefouili M, Nam K. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Sci Rep 2022; 12:11331. [PMID: 35790846 PMCID: PMC9256665 DOI: 10.1038/s41598-022-15563-2] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 06/27/2022] [Indexed: 11/09/2022] Open
Abstract
Identification of genetic variations is a central part of population and quantitative genomics studies based on high-throughput sequencing data. Even though popular variant callers such as Bcftools mpileup and GATK HaplotypeCaller were developed nearly 10 years ago, their performance is still largely unknown for non-human species. Here, we showed by benchmark analyses with a simulated insect population that Bcftools mpileup performs better than GATK HaplotypeCaller in terms of recovery rate and accuracy regardless of mapping software. The vast majority of false positives were observed from repeats, especially for GATK HaplotypeCaller. Variant scores calculated by GATK did not clearly distinguish true positives from false positives in the vast majority of cases, implying that hard-filtering with GATK could be challenging. These results suggest that Bcftools mpileup may be the first choice for non-human studies and that variants within repeats might have to be excluded for downstream analyses.
Collapse
Affiliation(s)
| | - Kiwoong Nam
- DGIMI, Univ Montpellier, INRAE, Montpellier, France.
| |
Collapse
|
5
|
Bagal UR, Phan J, Welsh RM, Misas E, Wagner D, Gade L, Litvintseva AP, Cuomo CA, Chow NA. MycoSNP: A Portable Workflow for Performing Whole-Genome Sequencing Analysis of Candida auris. Methods Mol Biol 2022; 2517:215-228. [PMID: 35674957 DOI: 10.1007/978-1-0716-2417-3_17] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Candida auris is an urgent public health threat characterized by high drug-resistant rates and rapid spread in healthcare settings worldwide. As part of the C. auris response, molecular surveillance has helped public health officials track the global spread and investigate local outbreaks. Here, we describe whole-genome sequencing analysis methods used for routine C. auris molecular surveillance in the United States; methods include reference selection, reference preparation, quality assessment and control of sequencing reads, read alignment, and single-nucleotide polymorphism calling and filtration. We also describe the newly developed pipeline MycoSNP, a portable workflow for performing whole-genome sequencing analysis of fungal organisms including C. auris.
Collapse
Affiliation(s)
- Ujwal R Bagal
- Mycotic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - John Phan
- Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Rory M Welsh
- Mycotic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Elizabeth Misas
- Mycotic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | | | - Lalitha Gade
- Mycotic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | | | - Christina A Cuomo
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Nancy A Chow
- Mycotic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA.
| |
Collapse
|
6
|
Technological Improvements in the Genetic Diagnosis of Rett Syndrome Spectrum Disorders. Int J Mol Sci 2021; 22:ijms221910375. [PMID: 34638716 PMCID: PMC8508637 DOI: 10.3390/ijms221910375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Revised: 09/17/2021] [Accepted: 09/22/2021] [Indexed: 11/17/2022] Open
Abstract
Rett syndrome (RTT) is a severe neurodevelopmental disorder that constitutes the second most common cause of intellectual disability in females worldwide. In the past few years, the advancements in genetic diagnosis brought by next generation sequencing (NGS), have made it possible to identify more than 90 causative genes for RTT and significantly overlapping phenotypes (RTT spectrum disorders). Therefore, the clinical entity known as RTT is evolving towards a spectrum of overlapping phenotypes with great genetic heterogeneity. Hence, simultaneous multiple gene testing and thorough phenotypic characterization are mandatory to achieve a fast and accurate genetic diagnosis. In this review, we revise the evolution of the diagnostic process of RTT spectrum disorders in the past decades, and we discuss the effectiveness of state-of-the-art genetic testing options, such as clinical exome sequencing and whole exome sequencing. Moreover, we introduce recent technological advancements that will very soon contribute to the increase in diagnostic yield in patients with RTT spectrum disorders. Techniques such as whole genome sequencing, integration of data from several “omics”, and mosaicism assessment will provide the tools for the detection and interpretation of genomic variants that will not only increase the diagnostic yield but also widen knowledge about the pathophysiology of these disorders.
Collapse
|
7
|
Nisar H, Wajid B, Shahid S, Anwar F, Wajid I, Khatoon A, Sattar MU, Sadaf S. Whole-genome sequencing as a first-tier diagnostic framework for rare genetic diseases. Exp Biol Med (Maywood) 2021; 246:2610-2617. [PMID: 34521224 DOI: 10.1177/15353702211040046] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Rare diseases affect nearly 300 million people globally with most patients aged five or less. Traditional diagnostic approaches have provided much of the diagnosis; however, there are limitations. For instance, simply inadequate and untimely diagnosis adversely affects both the patient and their families. This review advocates the use of whole genome sequencing in clinical settings for diagnosis of rare genetic diseases by showcasing five case studies. These examples specifically describe the utilization of whole genome sequencing, which helped in providing relief to patients via correct diagnosis followed by use of precision medicine.
Collapse
Affiliation(s)
- Haseeb Nisar
- Office of Research, Innovation and Commercialization, University of Management and Technology, Lahore 54000, Pakistan.,School of Biochemistry & Biotechnology, University of the Punjab, Lahore 54000, Pakistan
| | - Bilal Wajid
- Department of Electrical Engineering, University of Engineering and Technology, Lahore 54000, Pakistan.,Ibn Sina Research & Development Division, Sabz-Qalam, Lahore 54000, Pakistan.,Department of Computer Sciences, University of Management and Technology, Lahore 54000, Pakistan
| | - Samiah Shahid
- Institute of Molecular Biology and Biotechnology, The University of Lahore, Lahore 54000, Pakistan
| | - Faria Anwar
- Out Patient Department, Mayo Hospital, Lahore 54000, Pakistan
| | - Imran Wajid
- Ibn Sina Research & Development Division, Sabz-Qalam, Lahore 54000, Pakistan
| | - Asia Khatoon
- School of Biochemistry & Biotechnology, University of the Punjab, Lahore 54000, Pakistan
| | - Mian Usman Sattar
- Institute of Social Sciences, Istanbul Commerce University, Istanbul, Turkey
| | - Saima Sadaf
- School of Biochemistry & Biotechnology, University of the Punjab, Lahore 54000, Pakistan
| |
Collapse
|
8
|
Zanti M, Michailidou K, Loizidou MA, Machattou C, Pirpa P, Christodoulou K, Spyrou GM, Kyriacou K, Hadjisavvas A. Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels. BMC Bioinformatics 2021; 22:218. [PMID: 33910496 PMCID: PMC8080428 DOI: 10.1186/s12859-021-04144-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 04/15/2021] [Indexed: 11/10/2022] Open
Abstract
Background Next-generation sequencing (NGS) represents a significant advancement in clinical genetics. However, its use creates several technical, data interpretation and management challenges. It is essential to follow a consistent data analysis pipeline to achieve the highest possible accuracy and avoid false variant calls. Herein, we aimed to compare the performance of twenty-eight combinations of NGS data analysis pipeline compartments, including short-read mapping (BWA-MEM, Bowtie2, Stampy), variant calling (GATK-HaplotypeCaller, GATK-UnifiedGenotyper, SAMtools) and interval padding (null, 50 bp, 100 bp) methods, along with a commercially available pipeline (BWA Enrichment, Illumina®). Fourteen germline DNA samples from breast cancer patients were sequenced using a targeted NGS panel approach and subjected to data analysis. Results We highlight that interval padding is required for the accurate detection of intronic variants including spliceogenic pathogenic variants (PVs). In addition, using nearly default parameters, the BWA Enrichment algorithm, failed to detect these spliceogenic PVs and a missense PV in the TP53 gene. We also recommend the BWA-MEM algorithm for sequence alignment, whereas variant calling should be performed using a combination of variant calling algorithms; GATK-HaplotypeCaller and SAMtools for the accurate detection of insertions/deletions and GATK-UnifiedGenotyper for the efficient detection of single nucleotide variant calls. Conclusions These findings have important implications towards the identification of clinically actionable variants through panel testing in a clinical laboratory setting, when dedicated bioinformatics personnel might not always be available. The results also reveal the necessity of improving the existing tools and/or at the same time developing new pipelines to generate more reliable and more consistent data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04144-1.
Collapse
Affiliation(s)
- Maria Zanti
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Bioinformatics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyriaki Michailidou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Biostatistics Unit, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Maria A Loizidou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus
| | - Christina Machattou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Panagiota Pirpa
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyproula Christodoulou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Neurogenetics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - George M Spyrou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Bioinformatics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyriacos Kyriacou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus
| | - Andreas Hadjisavvas
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus. .,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.
| |
Collapse
|
9
|
Molina-Mora JA, Solano-Vargas M. Set-theory based benchmarking of three different variant callers for targeted sequencing. BMC Bioinformatics 2021; 22:20. [PMID: 33413082 PMCID: PMC7791862 DOI: 10.1186/s12859-020-03926-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Accepted: 12/09/2020] [Indexed: 12/05/2022] Open
Abstract
Background Next generation sequencing (NGS) technologies have improved the study of hereditary diseases. Since the evaluation of bioinformatics pipelines is not straightforward, NGS demands effective strategies to analyze data that is of paramount relevance for decision making under a clinical scenario. According to the benchmarking framework of the Global Alliance for Genomics and Health (GA4GH), we implemented a new simple and user-friendly set-theory based method to assess variant callers using a gold standard variant set and high confidence regions. As model, we used TruSight Cardio kit sequencing data of the reference genome NA12878. This targeted sequencing kit is used to identify variants in key genes related to Inherited Cardiac Conditions (ICCs), a group of cardiovascular diseases with high rates of morbidity and mortality. Results We implemented and compared three variant calling pipelines (Isaac, Freebayes, and VarScan). Performance metrics using our set-theory approach showed high-resolution pipelines and revealed: (1) a perfect recall of 1.000 for all three pipelines, (2) very high precision values, i.e. 0.987 for Freebayes, 0.928 for VarScan, and 1.000 for Isaac, when compared with the reference material, and (3) a ROC curve analysis with AUC > 0.94 for all cases. Moreover, significant differences were obtained between the three pipelines. In general, results indicate that the three pipelines were able to recognize the expected variants in the gold standard data set. Conclusions Our set-theory approach to calculate metrics was able to identify the expected ICCs related variants by the three selected pipelines, but results were completely dependent on the algorithms. We emphasize the importance to assess pipelines using gold standard materials to achieve the most reliable results for clinical application.
Collapse
Affiliation(s)
- Jose Arturo Molina-Mora
- Centro de Investigación en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica (UCR), San José, Costa Rica. .,Centro de Investigaciones en Hematología y Transtornos Afines (CIHATA), Universidad de Costa Rica (UCR), San José, Costa Rica.
| | - Mariela Solano-Vargas
- Centro de Investigaciones en Hematología y Transtornos Afines (CIHATA), Universidad de Costa Rica (UCR), San José, Costa Rica
| |
Collapse
|
10
|
Alosaimi S, van Biljon N, Awany D, Thami PK, Defo J, Mugo JW, Bope CD, Mazandu GK, Mulder NJ, Chimusa ER. Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches. Brief Bioinform 2020; 22:6042242. [PMID: 33341897 DOI: 10.1093/bib/bbaa366] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 11/14/2020] [Accepted: 01/08/2020] [Indexed: 12/15/2022] Open
Abstract
Current variant calling (VC) approaches have been designed to leverage populations of long-range haplotypes and were benchmarked using populations of European descent, whereas most genetic diversity is found in non-European such as Africa populations. Working with these genetically diverse populations, VC tools may produce false positive and false negative results, which may produce misleading conclusions in prioritization of mutations, clinical relevancy and actionability of genes. The most prominent question is which tool or pipeline has a high rate of sensitivity and precision when analysing African data with either low or high sequence coverage, given the high genetic diversity and heterogeneity of this data. Here, a total of 100 synthetic Whole Genome Sequencing (WGS) samples, mimicking the genetics profile of African and European subjects for different specific coverage levels (high/low), have been generated to assess the performance of nine different VC tools on these contrasting datasets. The performances of these tools were assessed in false positive and false negative call rates by comparing the simulated golden variants to the variants identified by each VC tool. Combining our results on sensitivity and positive predictive value (PPV), VarDict [PPV = 0.999 and Matthews correlation coefficient (MCC) = 0.832] and BCFtools (PPV = 0.999 and MCC = 0.813) perform best when using African population data on high and low coverage data. Overall, current VC tools produce high false positive and false negative rates when analysing African compared with European data. This highlights the need for development of VC approaches with high sensitivity and precision tailored for populations characterized by high genetic variations and low linkage disequilibrium.
Collapse
Affiliation(s)
- Shatha Alosaimi
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Noëlle van Biljon
- Department of Statistical Sciences, University of Cape Town, Cape Town, South Africa
| | - Denis Awany
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Prisca K Thami
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Joel Defo
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa
| | - Jacquiline W Mugo
- Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa
| | - Christian D Bope
- Faculty of Sciences, Department of Mathematics and Computer Science, University of Kinshasa, Kinshasa, DRC
| | - Gaston K Mazandu
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa.,Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa
| | - Nicola J Mulder
- Faculty of Health Sciences, Division of Computational Biology, Department of Biomedical Sciences, University of Cape Town, Cape Town, South Africa.,Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Anzio Road, Observatory, Cape Town 7925, South Africa
| | - Emile R Chimusa
- Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa.,Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Anzio Road, Observatory, Cape Town 7925, South Africa
| |
Collapse
|
11
|
Brandies PA, Wright BR, Hogg CJ, Grueber CE, Belov K. Characterization of reproductive gene diversity in the endangered Tasmanian devil. Mol Ecol Resour 2020; 21:721-732. [PMID: 33188658 DOI: 10.1111/1755-0998.13295] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 10/25/2020] [Accepted: 11/05/2020] [Indexed: 01/11/2023]
Abstract
Interindividual variation at genes known to play a role in reproduction may impact reproductive fitness. The Tasmanian devil is an endangered Australian marsupial with low genetic diversity. Recent work has shown concerning declines in productivity in both wild and captive populations over time. Understanding whether functional diversity exists at reproductive genes in the Tasmanian devil is a key first step in identifying genes that may influence productivity. We characterized single nucleotide polymorphisms (SNPs) at 214 genes involved in reproduction in 37 Tasmanian devils. Twenty genes contained nonsynonymous substitutions, with genes involved in embryogenesis, fertilization and hormonal regulation of reproduction displaying greater numbers of nonsynonymous SNPs than synonymous SNPs. Two genes, ADAMTS9 and NANOG, showed putative signatures of balancing selection indicating that natural selection is maintaining diversity at these genes despite the species exhibiting low overall levels of genetic diversity. We will use this information in future to examine the interplay between reproductive gene variation and reproductive fitness in Tasmanian devil populations.
Collapse
Affiliation(s)
- Parice A Brandies
- School of Life and Environmental Sciences, Faculty of Science, University of Sydney, Sydney, NSW, Australia
| | - Belinda R Wright
- School of Life and Environmental Sciences, Faculty of Science, University of Sydney, Sydney, NSW, Australia
| | - Carolyn J Hogg
- School of Life and Environmental Sciences, Faculty of Science, University of Sydney, Sydney, NSW, Australia
| | - Catherine E Grueber
- School of Life and Environmental Sciences, Faculty of Science, University of Sydney, Sydney, NSW, Australia.,San Diego Zoo Global, San Diego, CA, USA
| | - Katherine Belov
- School of Life and Environmental Sciences, Faculty of Science, University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
12
|
Li X, Yang J, Shen M, Xie XL, Liu GJ, Xu YX, Lv FH, Yang H, Yang YL, Liu CB, Zhou P, Wan PC, Zhang YS, Gao L, Yang JQ, Pi WH, Ren YL, Shen ZQ, Wang F, Deng J, Xu SS, Salehian-Dehkordi H, Hehua E, Esmailizadeh A, Dehghani-Qanatqestani M, Štěpánek O, Weimann C, Erhardt G, Amane A, Mwacharo JM, Han JL, Hanotte O, Lenstra JA, Kantanen J, Coltman DW, Kijas JW, Bruford MW, Periasamy K, Wang XH, Li MH. Whole-genome resequencing of wild and domestic sheep identifies genes associated with morphological and agronomic traits. Nat Commun 2020; 11:2815. [PMID: 32499537 PMCID: PMC7272655 DOI: 10.1038/s41467-020-16485-1] [Citation(s) in RCA: 178] [Impact Index Per Article: 35.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2019] [Accepted: 05/04/2020] [Indexed: 01/15/2023] Open
Abstract
Understanding the genetic changes underlying phenotypic variation in sheep (Ovis aries) may facilitate our efforts towards further improvement. Here, we report the deep resequencing of 248 sheep including the wild ancestor (O. orientalis), landraces, and improved breeds. We explored the sheep variome and selection signatures. We detected genomic regions harboring genes associated with distinct morphological and agronomic traits, which may be past and potential future targets of domestication, breeding, and selection. Furthermore, we found non-synonymous mutations in a set of plausible candidate genes and significant differences in their allele frequency distributions across breeds. We identified PDGFD as a likely causal gene for fat deposition in the tails of sheep through transcriptome, RT-PCR, qPCR, and Western blot analyses. Our results provide insights into the demographic history of sheep and a valuable genomic resource for future genetic studies and improved genome-assisted breeding of sheep and other domestic animals.
Collapse
Affiliation(s)
- Xin Li
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences (CAS), Beijing, 100101, China
- University of Chinese Academy of Sciences (UCAS), Beijing, 100049, China
| | - Ji Yang
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences (CAS), Beijing, 100101, China
- College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China
| | - Min Shen
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
| | - Xing-Long Xie
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences (CAS), Beijing, 100101, China
- University of Chinese Academy of Sciences (UCAS), Beijing, 100049, China
| | - Guang-Jian Liu
- Novogene Bioinformatics Institute, Beijing, 100083, China
| | - Ya-Xi Xu
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences (CAS), Beijing, 100101, China
- College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China
| | - Feng-Hua Lv
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences (CAS), Beijing, 100101, China
- College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China
| | - Hua Yang
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
| | - Yong-Lin Yang
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
| | - Chang-Bin Liu
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
| | - Ping Zhou
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
| | - Peng-Cheng Wan
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
| | - Yun-Sheng Zhang
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
| | - Lei Gao
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
| | - Jing-Quan Yang
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
| | - Wen-Hui Pi
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China
| | - Yan-Ling Ren
- Shandong Binzhou Academy of Animal Science and Veterinary Medicine, Binzhou, 256600, China
| | - Zhi-Qiang Shen
- Shandong Binzhou Academy of Animal Science and Veterinary Medicine, Binzhou, 256600, China
| | - Feng Wang
- Institute of Sheep and Goat Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Juan Deng
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences (CAS), Beijing, 100101, China
- College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, 611130, China
| | - Song-Song Xu
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences (CAS), Beijing, 100101, China
- University of Chinese Academy of Sciences (UCAS), Beijing, 100049, China
| | - Hosein Salehian-Dehkordi
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences (CAS), Beijing, 100101, China
- University of Chinese Academy of Sciences (UCAS), Beijing, 100049, China
| | - Eer Hehua
- Grass-Feeding Livestock Engineering Technology Research Center, Ningxia Academy of Agriculture and Forestry Sciences, Yinchuan, China
| | - Ali Esmailizadeh
- Department of Animal Science, Faculty of Agriculture, Shahid Bahonar University of Kerman, Kerman, Iran
| | | | - Ondřej Štěpánek
- Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague 4, Czech Republic
| | - Christina Weimann
- Institute of Animal Breeding and Genetics, Justus Liebig University, Giessen, Germany
| | - Georg Erhardt
- Institute of Animal Breeding and Genetics, Justus Liebig University, Giessen, Germany
| | - Agraw Amane
- Department of Microbial, Cellular and Molecular Biology, Addis Ababa University, Addis Ababa, Ethiopia
- LiveGene Program, International Livestock Research Institute, Addis Ababa, Ethiopia
| | - Joram M Mwacharo
- Small Ruminant Genomics, International Centre for Agricultural Research in the Dry Areas (ICARDA), Addis Ababa, Ethiopia
| | - Jian-Lin Han
- CAAS-ILRI Joint Laboratory on Livestock and Forage Genetic Resources, Institute of Animal Science, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Livestock Genetics Program, International Livestock Research Institute (ILRI), Nairobi, Kenya
| | - Olivier Hanotte
- LiveGene Program, International Livestock Research Institute, Addis Ababa, Ethiopia
- School of Life Sciences, University of Nottingham, University Park, Nottingham, NG7 2RD, UK
- Center for Tropical Livestock Genetics and Health (CTLGH), the Roslin Institute, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, Scotland, UK
| | - Johannes A Lenstra
- Faculty of Veterinary Medicine, Utrecht University, Utrecht, the Netherlands
| | - Juha Kantanen
- Production Systems, Natural Resources Institute Finland (Luke), FI-31600, Jokioinen, Finland
| | - David W Coltman
- Department of Biological Sciences, University of Alberta, Edmonton, Alberta, T6G 2E9, Canada
| | - James W Kijas
- CSIRO Livestock Industries, St Lucia, Brisbane, QLD, Australia
| | - Michael W Bruford
- School of Biosciences, Cardiff University, Cathays Park, Cardiff, CF10 3AX, Wales, UK
- Sustainable Places Research Institute, Cardiff University, CF10 3BA, Cardiff, Wales, UK
| | - Kathiravan Periasamy
- Animal Production and Health Laboratory, Joint FAO/IAEA Division of Nuclear Techniques in Food and Agriculture, International Atomic Energy Agency, Vienna, Austria
| | - Xin-Hua Wang
- Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China.
- State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Sciences, Shihezi, 832000, China.
| | - Meng-Hua Li
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences (CAS), Beijing, 100101, China.
- College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China.
| |
Collapse
|
13
|
Bewicke-Copley F, Arjun Kumar E, Palladino G, Korfi K, Wang J. Applications and analysis of targeted genomic sequencing in cancer studies. Comput Struct Biotechnol J 2019; 17:1348-1359. [PMID: 31762958 PMCID: PMC6861594 DOI: 10.1016/j.csbj.2019.10.004] [Citation(s) in RCA: 90] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2019] [Revised: 10/18/2019] [Accepted: 10/22/2019] [Indexed: 12/31/2022] Open
Abstract
Next Generation Sequencing (NGS) has dramatically improved the flexibility and outcomes of cancer research and clinical trials, providing highly sensitive and accurate high-throughput platforms for large-scale genomic testing. In contrast to whole-genome (WGS) or whole-exome sequencing (WES), targeted genomic sequencing (TS) focuses on a panel of genes or targets known to have strong associations with pathogenesis of disease and/or clinical relevance, offering greater sequencing depth with reduced costs and data burden. This allows targeted sequencing to identify low frequency variants in targeted regions with high confidence, thus suitable for profiling low-quality and fragmented clinical DNA samples. As a result, TS has been widely used in clinical research and trials for patient stratification and the development of targeted therapeutics. However, its transition to routine clinical use has been slow. Many technical and analytical obstacles still remain and need to be discussed and addressed before large-scale and cross-centre implementation. Gold-standard and state-of-the-art procedures and pipelines are urgently needed to accelerate this transition. In this review we first present how TS is conducted in cancer research, including various target enrichment platforms, the construction of target panels, and selected research and clinical studies utilising TS to profile clinical samples. We then present a generalised analytical workflow for TS data discussing important parameters and filters in detail, aiming to provide the best practices of TS usage and analyses.
Collapse
Key Words
- BAM, Binary Alignment Map
- BWA, Burrows-Wheeler Aligner
- Background error
- CLL, Chronic Lymphocytic Leukaemia
- COSMIC, Catalogue of Somatic Mutations in Cancer
- Cancer genomics
- Clinical samples
- ESP, Exome Sequencing Project
- FF, Fresh Frozen
- FFPE, Formalin Fixed Paraffin Embedded
- FL, Follicular Lymphoma
- GATK, Genome Analysis Toolkit
- ICGC, International Cancer Genome Consortium
- MBC, Molecular Barcode
- NCCN, the National Comprehensive Cancer Network®
- NGS, Next Generation Sequencing
- NHL, Non-Hodgkin Lymphoma
- NSCLC, Non-Small Cell Lung Carcinoma
- PCR duplicates
- QC, Quality Control
- SAM, Sequence Alignment Map
- TCGA, The Cancer Genome Atlas
- TS, Targeted Sequencing
- Targeted sequencing
- UMI, Unique Molecular Identifiers
- VAF, Variant Allele Frequency
- Variant calling
- WES, Whole Exome Sequencing
- WGS, Whole Genome Sequencing
- tFL, Transformed Follicular Lymphoma
Collapse
Affiliation(s)
- Findlay Bewicke-Copley
- Centre for Cancer Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
| | - Emil Arjun Kumar
- Centre for Cancer Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
- Centre for Haemato-Oncology, Barts Cancer Institute, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
| | - Giuseppe Palladino
- Centre for Haemato-Oncology, Barts Cancer Institute, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
| | - Koorosh Korfi
- Centre for Cancer Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
| | - Jun Wang
- Centre for Cancer Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK
| |
Collapse
|
14
|
Brandies P, Peel E, Hogg CJ, Belov K. The Value of Reference Genomes in the Conservation of Threatened Species. Genes (Basel) 2019; 10:E846. [PMID: 31717707 PMCID: PMC6895880 DOI: 10.3390/genes10110846] [Citation(s) in RCA: 82] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 10/18/2019] [Accepted: 10/23/2019] [Indexed: 12/17/2022] Open
Abstract
Conservation initiatives are now more crucial than ever-over a million plant and animal species are at risk of extinction over the coming decades. The genetic management of threatened species held in insurance programs is recommended; however, few are taking advantage of the full range of genomic technologies available today. Less than 1% of the 13505 species currently listed as threated by the International Union for Conservation of Nature (IUCN) have a published genome. While there has been much discussion in the literature about the importance of genomics for conservation, there are limited examples of how having a reference genome has changed conservation management practice. The Tasmanian devil (Sarcophilus harrisii), is an endangered Australian marsupial, threatened by an infectious clonal cancer devil facial tumor disease (DFTD). Populations have declined by 80% since the disease was first recorded in 1996. A reference genome for this species was published in 2012 and has been crucial for understanding DFTD and the management of the species in the wild. Here we use the Tasmanian devil as an example of how a reference genome has influenced management actions in the conservation of a species.
Collapse
Affiliation(s)
| | | | | | - Katherine Belov
- School of Life & Environmental Sciences, The University of Sydney, Sydney 2006, Australia; (P.B.); (E.P.); (C.J.H.)
| |
Collapse
|
15
|
Wu X, Heffelfinger C, Zhao H, Dellaporta SL. Benchmarking variant identification tools for plant diversity discovery. BMC Genomics 2019; 20:701. [PMID: 31500583 PMCID: PMC6734213 DOI: 10.1186/s12864-019-6057-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Accepted: 08/22/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. RESULTS A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation. CONCLUSIONS Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.
Collapse
Affiliation(s)
- Xing Wu
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, 06520-8104, USA
| | - Christopher Heffelfinger
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, 06520-8104, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, 06520-8034, USA
| | - Stephen L Dellaporta
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, 06520-8104, USA.
| |
Collapse
|
16
|
Malmberg MM, Spangenberg GC, Daetwyler HD, Cogan NOI. Assessment of low-coverage nanopore long read sequencing for SNP genotyping in doubled haploid canola (Brassica napus L.). Sci Rep 2019; 9:8688. [PMID: 31213642 PMCID: PMC6582154 DOI: 10.1038/s41598-019-45131-0] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Accepted: 05/28/2019] [Indexed: 11/16/2022] Open
Abstract
Despite the high accuracy of short read sequencing (SRS), there are still issues with attaining accurate single nucleotide polymorphism (SNP) genotypes at low sequencing coverage and in highly duplicated genomes due to misalignment. Long read sequencing (LRS) systems, including the Oxford Nanopore Technologies (ONT) minION, have become popular options for de novo genome assembly and structural variant characterisation. The current high error rate often requires substantial post-sequencing correction and would appear to prevent the adoption of this system for SNP genotyping, but nanopore sequencing errors are largely random. Using low coverage ONT minION sequencing for genotyping of pre-validated SNP loci was examined in 9 canola doubled haploids. The minION genotypes were compared to the Illumina sequences to determine the extent and nature of genotype discrepancies between the two systems. The significant increase in read length improved alignment to the genome and the absence of classical SRS biases results in a more even representation of the genome. Sequencing errors are present, primarily in the form of heterozygous genotypes, which can be removed in completely homozygous backgrounds but requires more advanced bioinformatics in heterozygous genomes. Developments in this technology are promising for routine genotyping in the future.
Collapse
Affiliation(s)
- M M Malmberg
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, 5 Ring Road, Bundoora, Victoria, 3083, Australia.,School of Applied Systems Biology, La Trobe University, Bundoora, Victoria, 3086, Australia
| | - G C Spangenberg
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, 5 Ring Road, Bundoora, Victoria, 3083, Australia.,School of Applied Systems Biology, La Trobe University, Bundoora, Victoria, 3086, Australia
| | - H D Daetwyler
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, 5 Ring Road, Bundoora, Victoria, 3083, Australia.,School of Applied Systems Biology, La Trobe University, Bundoora, Victoria, 3086, Australia
| | - N O I Cogan
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, 5 Ring Road, Bundoora, Victoria, 3083, Australia. .,School of Applied Systems Biology, La Trobe University, Bundoora, Victoria, 3086, Australia.
| |
Collapse
|
17
|
Wang Y, Li X, Osmundson T, Shi L, Yan H. Comparative Genomic Analysis of a Multidrug-Resistant Listeria monocytogenes ST477 Isolate. Foodborne Pathog Dis 2019; 16:604-615. [PMID: 31094569 DOI: 10.1089/fpd.2018.2611] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Listeria monocytogenes is an opportunistic human foodborne pathogen that causes severe infections with high hospitalization and fatality rates. Clonal complex 9 (CC9) contains a large number of sequence types (STs) and is one of the predominant clones distributed worldwide. However, genetic characteristics of ST477 isolates, which also belong to CC9, have never been examined, and little is known about the detail genomic traits of this food-associated clone. In this study, we sequenced and constructed the whole-genome sequence of an ST477 isolate from a frozen food sample in China and compared it with 58 previously sequenced genomes of 25 human-associated, 5 animal, and 27 food isolates consisting of 6 CC9 and 52 other clones. Phylogenetic analysis revealed that the ST477 clustered with three Canadian ST9 isolates. All phylogeny revealed that CC9 isolates involved in this study consistently possessed the invasion-related gene vip. Mobile genetic elements (MGEs), resistance genes, and clustered regularly interspaced short palindromic repeats (CRISPR)/Cas system were elucidated among CC9 isolates. Our ST477 isolate contained a Tn554-like transposon, carrying five arsenical-resistance genes (arsA-arsD, arsR), which was exclusively identified in the CC9 background. Compared with the ST477 genome, three Canadian ST9 isolates shared nonsynonymous nucleotide substitutions in the condensin complex gene smc and cell surface protein genes ftsA and essC. Our findings preliminarily indicate that the extraordinary success of CC9 clone in colonization of different geographical regions is likely due to conserved features harboring MGEs, functional virulence and resistance genes. ST477 and three ST9 genomes are closely related and the distinct differences between them consist primarily of changes in genes involved in multiplication and invasion, which may contribute to the prevalence of ST9 isolates in food and food processing environment.
Collapse
Affiliation(s)
- Yage Wang
- School of Food Science and Engineering, South China University of Technology, Guangzhou, China
| | - Xinhui Li
- Department of Microbiology, University of Wisconsin-La Crosse, La Crosse, Wisconsin
| | - Todd Osmundson
- Department of Biology, University of Wisconsin-La Crosse, La Crosse, Wisconsin
| | - Lei Shi
- Institute of Food Safety and Nutrition, Jinan University, Guangzhou, China.,State key Laboratory of Food Safely Technology for Meat Products, Xiamen, China
| | - He Yan
- School of Food Science and Engineering, South China University of Technology, Guangzhou, China
| |
Collapse
|
18
|
Crysnanto D, Wurmser C, Pausch H. Accurate sequence variant genotyping in cattle using variation-aware genome graphs. Genet Sel Evol 2019; 51:21. [PMID: 31092189 PMCID: PMC6521551 DOI: 10.1186/s12711-019-0462-x] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 05/03/2019] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Genotyping of sequence variants typically involves, as a first step, the alignment of sequencing reads to a linear reference genome. Because a linear reference genome represents only a small fraction of all the DNA sequence variation within a species, reference allele bias may occur at highly polymorphic or divergent regions of the genome. Graph-based methods facilitate the comparison of sequencing reads to a variation-aware genome graph, which incorporates a collection of non-redundant DNA sequences that segregate within a species. We compared the accuracy and sensitivity of graph-based sequence variant genotyping using the Graphtyper software to two widely-used methods, i.e., GATK and SAMtools, which rely on linear reference genomes using whole-genome sequencing data from 49 Original Braunvieh cattle. RESULTS We discovered 21,140,196, 20,262,913, and 20,668,459 polymorphic sites using GATK, Graphtyper, and SAMtools, respectively. Comparisons between sequence variant genotypes and microarray-derived genotypes showed that Graphtyper outperformed both GATK and SAMtools in terms of genotype concordance, non-reference sensitivity, and non-reference discrepancy. The sequence variant genotypes that were obtained using Graphtyper had the smallest number of Mendelian inconsistencies between sequence-derived single nucleotide polymorphisms and indels in nine sire-son pairs. Genotype phasing and imputation using the Beagle software improved the quality of the sequence variant genotypes for all the tools evaluated, particularly for animals that were sequenced at low coverage. Following imputation, the concordance between sequence- and microarray-derived genotypes was almost identical for the three methods evaluated, i.e., 99.32, 99.46, and 99.24% for GATK, Graphtyper, and SAMtools, respectively. Variant filtration based on commonly used criteria improved genotype concordance slightly but it also decreased sensitivity. Graphtyper required considerably more computing resources than SAMtools but less than GATK. CONCLUSIONS Sequence variant genotyping using Graphtyper is accurate, sensitive and computationally feasible in cattle. Graph-based methods enable sequence variant genotyping from variation-aware reference genomes that may incorporate cohort-specific sequence variants, which is not possible with the current implementation of state-of-the-art methods that rely on linear reference genomes.
Collapse
|
19
|
Ali H, Al-Mulla F, Hussain N, Naim M, Asbeutah AM, AlSahow A, Abu-Farha M, Abubaker J, Al Madhoun A, Ahmad S, Harris PC. PKD1 Duplicated regions limit clinical Utility of Whole Exome Sequencing for Genetic Diagnosis of Autosomal Dominant Polycystic Kidney Disease. Sci Rep 2019; 9:4141. [PMID: 30858458 PMCID: PMC6412018 DOI: 10.1038/s41598-019-40761-w] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2018] [Accepted: 02/21/2019] [Indexed: 12/18/2022] Open
Abstract
Autosomal dominant polycystic kidney disease (ADPKD) is an inherited monogenic renal disease characterised by the accumulation of clusters of fluid-filled cysts in the kidneys and is caused by mutations in PKD1 or PKD2 genes. ADPKD genetic diagnosis is complicated by PKD1 pseudogenes located proximal to the original gene with a high degree of homology. The next generation sequencing (NGS) technology including whole exome sequencing (WES) and whole genome sequencing (WGS), is becoming more affordable and its use in the detection of ADPKD mutations for diagnostic and research purposes more widespread. However, how well does NGS technology compare with the Gold standard (Sanger sequencing) in the detection of ADPKD mutations? Is a question that remains to be answered. We have evaluated the efficacy of WES, WGS and targeted enrichment methodologies in detecting ADPKD mutations in the PKD1 and PKD2 genes in patients who were clinically evaluated by ultrasonography and renal function tests. Our results showed that WES detected PKD1 mutations in ADPKD patients with 50% sensitivity, as the reading depth and sequencing quality were low in the duplicated regions of PKD1 (exons 1-32) compared with those of WGS and target enrichment arrays. Our investigation highlights major limitations of WES in ADPKD genetic diagnosis. Enhancing reading depth, quality and sensitivity of WES in the PKD1 duplicated regions (exons 1-32) is crucial for its potential diagnostic or research applications.
Collapse
Affiliation(s)
- Hamad Ali
- Department of Medical Laboratory Sciences, Faculty of Allied Health Sciences, Health Sciences Center, Kuwait University, Jabriya, Kuwait.
- Department of Genetics and Bioinformatics, Dasman Diabetes Institute (DDI), Dasman, Kuwait.
- Division of Nephrology, Mubarak Al-Kabeer Hospital, Ministry of Health, Jabriya, Kuwait.
| | - Fahd Al-Mulla
- Department of Genetics and Bioinformatics, Dasman Diabetes Institute (DDI), Dasman, Kuwait.
| | - Naser Hussain
- Division of Nephrology, Mubarak Al-Kabeer Hospital, Ministry of Health, Jabriya, Kuwait
| | - Medhat Naim
- Division of Nephrology, Mubarak Al-Kabeer Hospital, Ministry of Health, Jabriya, Kuwait
| | - Akram M Asbeutah
- Department of Radiological Sciences, Faculty of Allied Health Sciences, Health Sciences Center, Kuwait University, Jabriya, Kuwait
| | - Ali AlSahow
- Division of Nephrology, Al-Jahra Hospital, Ministry of Health, Al-Jahra, Kuwait
| | - Mohamed Abu-Farha
- Department of Biochemistry and Molecular Biology, Dasman Diabetes Institute (DDI), Dasman, Kuwait
| | - Jehad Abubaker
- Department of Biochemistry and Molecular Biology, Dasman Diabetes Institute (DDI), Dasman, Kuwait
| | - Ashraf Al Madhoun
- Department of Genetics and Bioinformatics, Dasman Diabetes Institute (DDI), Dasman, Kuwait
| | - Sajjad Ahmad
- Department of Cornea and External Diseases, Moorfields Eye Hospital-NHS Foundation Trust, London, United Kingdom
- Institute of Ophthalmology, University Collage London (UCL), London, United Kingdom
| | - Peter C Harris
- Division of Nephrology and Hypertension, Mayo Clinic, Rochester, USA
| |
Collapse
|
20
|
Böhne A, Weber AAT, Rajkov J, Rechsteiner M, Riss A, Egger B, Salzburger W. Repeated Evolution Versus Common Ancestry: Sex Chromosome Evolution in the Haplochromine Cichlid Pseudocrenilabrus philander. Genome Biol Evol 2019; 11:439-458. [PMID: 30649313 PMCID: PMC6375353 DOI: 10.1093/gbe/evz003] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/07/2019] [Indexed: 12/15/2022] Open
Abstract
Why sex chromosomes turn over and remain undifferentiated in some taxa, whereas they degenerate in others, is still an area of ongoing research. The recurrent occurrence of homologous and homomorphic sex chromosomes in distantly related taxa suggests their independent evolution or continued recombination since their first emergence. Fishes display a great diversity of sex-determining systems. Here, we focus on sex chromosome evolution in haplochromines, the most species-rich lineage of cichlid fishes. We investigate sex-specific signatures in the Pseudocrenilabrus philander species complex, which belongs to a haplochromine genus found in many river systems and ichthyogeographic regions in northern, eastern, central, and southern Africa. Using whole-genome sequencing and population genetic, phylogenetic, and read-coverage analyses, we show that one population of P. philander has an XX-XY sex-determining system on LG7 with a large region of suppressed recombination. However, in a second bottlenecked population, we did not find any sign of a sex chromosome. Interestingly, LG7 also carries an XX-XY system in the phylogenetically more derived Lake Malawi haplochromine cichlids. Although the genomic regions determining sex are the same in Lake Malawi cichlids and P. philander, we did not find evidence for shared ancestry, suggesting that LG7 evolved as sex chromosome at least twice in haplochromine cichlids. Hence, our work provides further evidence for the labile nature of sex determination in fishes and supports the hypothesis that the same genomic regions can repeatedly and rapidly be recruited as sex chromosomes in more distantly related lineages.
Collapse
Affiliation(s)
- Astrid Böhne
- Department of Environmental Sciences, Zoological Institute, University of Basel, Switzerland
| | - Alexandra Anh-Thu Weber
- Department of Environmental Sciences, Zoological Institute, University of Basel, Switzerland
- Museums Victoria, Melbourne, Victoria, Australia
| | - Jelena Rajkov
- Department of Environmental Sciences, Zoological Institute, University of Basel, Switzerland
| | - Michael Rechsteiner
- Department of Environmental Sciences, Zoological Institute, University of Basel, Switzerland
| | - Andrin Riss
- Department of Environmental Sciences, Zoological Institute, University of Basel, Switzerland
| | - Bernd Egger
- Department of Environmental Sciences, Zoological Institute, University of Basel, Switzerland
- Program Man Society Environment, University of Basel, Switzerland
| | - Walter Salzburger
- Department of Environmental Sciences, Zoological Institute, University of Basel, Switzerland
| |
Collapse
|
21
|
Zhou W, Li X, Osmundson T, Shi L, Ren J, Yan H. WGS analysis of ST9-MRSA-XII isolates from live pigs in China provides insights into transmission among porcine, human and bovine hosts. J Antimicrob Chemother 2018; 73:2652-2661. [PMID: 29986036 DOI: 10.1093/jac/dky245] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 05/27/2018] [Indexed: 01/21/2025] Open
Abstract
OBJECTIVES To elucidate the phylogenetic relationships among ST9-MRSA-XII isolates from different sources and their genetic features in colonization of different hosts. METHODS We obtained whole-genome sequences of two ST9-MRSA-XII isolates from nasal swabs associated with live pigs in China, and compared them with 135 previously sequenced genomes of 78 human-associated, 39 bovine and 18 porcine Staphylococcus aureus consisting of 11 MRSA of SCCmecXII, 62 MRSA of other SCCmec types and 62 MSSA. The distribution of diverse mobile genetic elements (MGEs), resistance genes and virulence determinants was investigated in relation to isolate phylogeny. Comparisons of SNPs and small insertion/deletions (indels) were conducted to examine genome-level variation between porcine and bovine ST9-MRSA-XII. RESULTS Phylogenetic analysis revealed that both of our porcine ST9-MRSA-XII isolates clustered with porcine, bovine and human-associated ST9-MRSA-XII. All of these isolates possessed a novel type V pathogenicity island, νSaα, carrying the von Willebrand-binding protein gene vwb, the immune evasion complex gene scn, the aminoglycoside resistance gene aadE, staphylococcal superantigen-like genes (ssl1-ssl11) and lpl tandem genes. Compared with bovine ST9-MRSA-XII BA01611, our porcine isolates contain non-synonymous nucleotide substitutions in genes encoding adhesins and an indel located in a phosphonate ABC transporter pseudogene. CONCLUSIONS The data suggest transmission of ST9-MRSA-XII among swine, cattle and humans. The extraordinary success of the ST9-MRSA-XII group in colonization of various hosts is likely due to acquisition of many MGEs harbouring functional antimicrobial resistance and virulence genes. Transmission of ST9-MRSA-XII between porcine and bovine hosts was accompanied by changes in binding profile and function in genes involved in metabolism.
Collapse
Affiliation(s)
- Wenyuan Zhou
- School of Food Science and Engineering, South China University of Technology, Guangzhou, China
| | - Xinhui Li
- Department of Microbiology, University of Wisconsin-La Crosse, 1725 State Street, La Crosse, WI, USA
| | - Todd Osmundson
- Department of Biology, University of Wisconsin-La Crosse, 1725 State Street, La Crosse, WI, USA
| | - Lei Shi
- Institute of Food Safety and Nutrition, Jinan University, Guangzhou, China
- State Key Laboratory of Food Safely Technology for Meat Products, Xiamen, Fujian, China
| | - Jiaoyan Ren
- School of Food Science and Engineering, South China University of Technology, Guangzhou, China
| | - He Yan
- School of Food Science and Engineering, South China University of Technology, Guangzhou, China
- State Key Laboratory of Food Safely Technology for Meat Products, Xiamen, Fujian, China
| |
Collapse
|
22
|
Zhu X, Yu H, Xiao Q, Ke J, Li H, Chen Z, Ding H, Leng S, Huang Y, Zhan J, Lei J, Fan W, Luo H. Genetic variations in chromodomain helicase DNA-binding protein 5, gene-environment interactions and risk of sporadic Alzheimer's disease in Chinese population. Oncotarget 2018; 9:24872-24881. [PMID: 29861839 PMCID: PMC5982770 DOI: 10.18632/oncotarget.23791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2017] [Accepted: 12/05/2017] [Indexed: 11/25/2022] Open
Abstract
CHD5 is an essential factor for neuronal differentiation and neurodegenerative diseases. Here, the targeted next generation sequencing and TaqMan genotyping technologies were carried out for CHD5 gene in a two-staged case-control study in Chinese population. The genetic statistics and gene-environment interactions were analyzed to find certain risk factors of Alzheimer's disease. We found intronic rs11121295 was associated with the risk of Alzheimer's disease at both stages including combined cohorts. This risk effect presented consistently significant associations with the alcoholic subgroups at both all stages in the stratified analysis. The gene-environment interactions further supported the above findings. Our study highlighted the potential role of CHD5 variants in conferring susceptibility to sporadic Alzheimer's disease, especially modified its risk by alcoholic intake.
Collapse
Affiliation(s)
- Xiao Zhu
- Key Laboratory of Medical Molecular Diagnosis, Dongguan Scientific Research Center, Guangdong Medical University, Dongguan, China.,Institute of Bioinformatics, University of Georgia, Athens, GA, USA
| | - Haibing Yu
- Key Laboratory of Medical Molecular Diagnosis, Dongguan Scientific Research Center, Guangdong Medical University, Dongguan, China
| | - Qin Xiao
- Department of Blood Transfusion, Peking University Shenzhen Hospital, Shenzhen, China
| | - Jianhao Ke
- Tropical Crops Department, Guangdong AIB Polytechnic, Guangzhou, China
| | - Hongmei Li
- Key Laboratory of Medical Molecular Diagnosis, Dongguan Scientific Research Center, Guangdong Medical University, Dongguan, China
| | - Zhihong Chen
- Key Laboratory of Medical Molecular Diagnosis, Dongguan Scientific Research Center, Guangdong Medical University, Dongguan, China
| | - Hongrong Ding
- Key Laboratory of Medical Molecular Diagnosis, Dongguan Scientific Research Center, Guangdong Medical University, Dongguan, China
| | - Shuilong Leng
- Department of Human Anatomy, Guangzhou Medical University, Guangzhou, China
| | - Yongmei Huang
- Institute of Marine Medicine Research, Guangdong Medical University, Zhanjiang, China
| | - Jingting Zhan
- Institute of Marine Medicine Research, Guangdong Medical University, Zhanjiang, China
| | - Jinli Lei
- Institute of Marine Medicine Research, Guangdong Medical University, Zhanjiang, China
| | - Wenguo Fan
- Department of Anatomy and Physiology, Guanghua School of Stomatology, Sun Yat-sen University, Guangzhou, China
| | - Hui Luo
- Key Laboratory of Medical Molecular Diagnosis, Dongguan Scientific Research Center, Guangdong Medical University, Dongguan, China.,Institute of Marine Medicine Research, Guangdong Medical University, Zhanjiang, China
| |
Collapse
|
23
|
Shringarpure SS, Mathias RA, Hernandez RD, O'Connor TD, Szpiech ZA, Torres R, De La Vega FM, Bustamante CD, Barnes KC, Taub MA. Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data. Bioinformatics 2018; 33:1147-1153. [PMID: 28035032 PMCID: PMC5408850 DOI: 10.1093/bioinformatics/btw786] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Accepted: 12/07/2016] [Indexed: 12/30/2022] Open
Abstract
Motivation Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). Results We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies. Availability and Implementation Code is available on Github at: https://github.com/suyashss/variant_validation. Contacts suyashs@stanford.edu or mtaub@jhsph.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Suyash S Shringarpure
- Departments of Genetics and Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Rasika A Mathias
- 23 and Me Inc, Mountain View, CA, USA.,Department of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Ryan D Hernandez
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, MD, USA.,Department of Bioengineering and Therapeutic Sciences.,Institute for Human Genetics
| | - Timothy D O'Connor
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA.,Institute for Genome Sciences.,Program in Personalized and Genomic Medicine
| | - Zachary A Szpiech
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, MD, USA
| | - Raul Torres
- Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Francisco M De La Vega
- Departments of Genetics and Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Carlos D Bustamante
- Departments of Genetics and Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Kathleen C Barnes
- 23 and Me Inc, Mountain View, CA, USA.,Department of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Margaret A Taub
- Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, CA, USA
| | | |
Collapse
|
24
|
Whole-Genome Sequence Accuracy Is Improved by Replication in a Population of Mutagenized Sorghum. G3-GENES GENOMES GENETICS 2018; 8:1079-1094. [PMID: 29378822 PMCID: PMC5844295 DOI: 10.1534/g3.117.300301] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
The accurate detection of induced mutations is critical for both forward and reverse genetics studies. Experimental chemical mutagenesis induces relatively few single base changes per individual. In a complex eukaryotic genome, false positive detection of mutations can occur at or above this mutagenesis rate. We demonstrate here, using a population of ethyl methanesulfonate (EMS)-treated Sorghum bicolor BTx623 individuals, that using replication to detect false positive-induced variants in next-generation sequencing (NGS) data permits higher throughput variant detection with greater accuracy. We used a lower sequence coverage depth (average of 7×) from 586 independently mutagenized individuals and detected 5,399,493 homozygous single nucleotide polymorphisms (SNPs). Of these, 76% originated from only 57,872 genomic positions prone to false positive variant calling. These positions are characterized by high copy number paralogs where the error-prone SNP positions are at copies containing a variant at the SNP position. The ability of short stretches of homology to generate these error-prone positions suggests that incompletely assembled or poorly mapped repeated sequences are one driver of these error-prone positions. Removal of these false positives left 1,275,872 homozygous and 477,531 heterozygous EMS-induced SNPs, which, congruent with the mutagenic mechanism of EMS, were >98% G:C to A:T transitions. Through this analysis, we generated a collection of sequence indexed mutants of sorghum. This collection contains 4035 high-impact homozygous mutations in 3637 genes and 56,514 homozygous missense mutations in 23,227 genes. Each line contains, on average, 2177 annotated homozygous SNPs per genome, including seven likely gene knockouts and 96 missense mutations. The number of mutations in a transcript was linearly correlated with the transcript length and also the G+C count, but not with the GC/AT ratio. Analysis of the detected mutagenized positions identified CG-rich patches, and flanking sequences strongly influenced EMS-induced mutation rates. This method for detecting false positive-induced mutations is generally applicable to any organism, is independent of the choice of in silico variant-calling algorithm, and is most valuable when the true mutation rate is likely to be low, such as in laboratory-induced mutations or somatic mutation detection in medicine.
Collapse
|
25
|
Last AR, Pickering H, Roberts CH, Coll F, Phelan J, Burr SE, Cassama E, Nabicassa M, Seth-Smith HMB, Hadfield J, Cutcliffe LT, Clarke IN, Mabey DCW, Bailey RL, Clark TG, Thomson NR, Holland MJ. Population-based analysis of ocular Chlamydia trachomatis in trachoma-endemic West African communities identifies genomic markers of disease severity. Genome Med 2018; 10:15. [PMID: 29482619 PMCID: PMC5828069 DOI: 10.1186/s13073-018-0521-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2017] [Accepted: 02/13/2018] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Chlamydia trachomatis (Ct) is the most common infectious cause of blindness and bacterial sexually transmitted infection worldwide. Ct strain-specific differences in clinical trachoma suggest that genetic polymorphisms in Ct may contribute to the observed variability in severity of clinical disease. METHODS Using Ct whole genome sequences obtained directly from conjunctival swabs, we studied Ct genomic diversity and associations between Ct genetic polymorphisms with ocular localization and disease severity in a treatment-naïve trachoma-endemic population in Guinea-Bissau, West Africa. RESULTS All Ct sequences fall within the T2 ocular clade phylogenetically. This is consistent with the presence of the characteristic deletion in trpA resulting in a truncated non-functional protein and the ocular tyrosine repeat regions present in tarP associated with ocular tissue localization. We have identified 21 Ct non-synonymous single nucleotide polymorphisms (SNPs) associated with ocular localization, including SNPs within pmpD (odds ratio, OR = 4.07, p* = 0.001) and tarP (OR = 0.34, p* = 0.009). Eight synonymous SNPs associated with disease severity were found in yjfH (rlmB) (OR = 0.13, p* = 0.037), CTA0273 (OR = 0.12, p* = 0.027), trmD (OR = 0.12, p* = 0.032), CTA0744 (OR = 0.12, p* = 0.041), glgA (OR = 0.10, p* = 0.026), alaS (OR = 0.10, p* = 0.032), pmpE (OR = 0.08, p* = 0.001) and the intergenic region CTA0744-CTA0745 (OR = 0.13, p* = 0.043). CONCLUSIONS This study demonstrates the extent of genomic diversity within a naturally circulating population of ocular Ct and is the first to describe novel genomic associations with disease severity. These findings direct investigation of host-pathogen interactions that may be important in ocular Ct pathogenesis and disease transmission.
Collapse
Affiliation(s)
- A. R. Last
- Clinical Research Department, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
| | - H. Pickering
- Clinical Research Department, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
| | - C. h. Roberts
- Clinical Research Department, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
| | - F. Coll
- Department of Pathogen Molecular Biology, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
| | - J. Phelan
- Department of Pathogen Molecular Biology, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
| | - S. E. Burr
- Clinical Research Department, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
- Disease Control and Elimination Theme, Medical Research Council Unit The Gambia, Fajara, Gambia
| | - E. Cassama
- Programa Nacional de Saúde de Visão, Ministério de Saúde Publica, Bissau, Guinea-Bissau
| | - M. Nabicassa
- Programa Nacional de Saúde de Visão, Ministério de Saúde Publica, Bissau, Guinea-Bissau
| | - H. M. B. Seth-Smith
- Pathogen Genomics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
- Clinical Microbiology, Universitätsspital Basel, Basel, Switzerland
- Applied Microbiology Research, Department of Biomedicine, University of Basel, Basel, Switzerland
| | - J. Hadfield
- Pathogen Genomics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
| | - L. T. Cutcliffe
- Molecular Microbiology Group, University of Southampton Medical School, Southampton, UK
| | - I. N. Clarke
- Molecular Microbiology Group, University of Southampton Medical School, Southampton, UK
| | - D. C. W. Mabey
- Clinical Research Department, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
| | - R. L. Bailey
- Clinical Research Department, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
| | - T. G. Clark
- Department of Pathogen Molecular Biology, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
- Department of Infectious Diseases Epidemiology, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
| | - N. R. Thomson
- Department of Pathogen Molecular Biology, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
- Pathogen Genomics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
| | - M. J. Holland
- Clinical Research Department, London School of Hygiene and Tropical Medicine, Keppel Street, London, UK
| |
Collapse
|
26
|
Abstract
The majority of rare diseases affect children, most of whom have an underlying genetic cause for their condition. However, making a molecular diagnosis with current technologies and knowledge is often still a challenge. Paediatric genomics is an immature but rapidly evolving field that tackles this issue by incorporating next-generation sequencing technologies, especially whole-exome sequencing and whole-genome sequencing, into research and clinical workflows. This complex multidisciplinary approach, coupled with the increasing availability of population genetic variation data, has already resulted in an increased discovery rate of causative genes and in improved diagnosis of rare paediatric disease. Importantly, for affected families, a better understanding of the genetic basis of rare disease translates to more accurate prognosis, management, surveillance and genetic advice; stimulates research into new therapies; and enables provision of better support.
Collapse
|
27
|
Kobayashi M, Ohyanagi H, Takanashi H, Asano S, Kudo T, Kajiya-Kanegae H, Nagano AJ, Tainaka H, Tokunaga T, Sazuka T, Iwata H, Tsutsumi N, Yano K. Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data. DNA Res 2017; 24:397-405. [PMID: 28498906 PMCID: PMC5737671 DOI: 10.1093/dnares/dsx012] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2016] [Accepted: 04/20/2017] [Indexed: 12/30/2022] Open
Abstract
Recent availability of large-scale genomic resources enables us to conduct so called genome-wide association studies (GWAS) and genomic prediction (GP) studies, particularly with next-generation sequencing (NGS) data. The effectiveness of GWAS and GP depends on not only their mathematical models, but the quality and quantity of variants employed in the analysis. In NGS single nucleotide polymorphism (SNP) calling, conventional tools ideally require more reads for higher SNP sensitivity and accuracy. In this study, we aimed to develop a tool, Heap, that enables robustly sensitive and accurate calling of SNPs, particularly with a low coverage NGS data, which must be aligned to the reference genome sequences in advance. To reduce false positive SNPs, Heap determines genotypes and calls SNPs at each site except for sites at the both ends of reads or containing a minor allele supported by only one read. Performance comparison with existing tools showed that Heap achieved the highest F-scores with low coverage (7X) restriction-site associated DNA sequencing reads of sorghum and rice individuals. This will facilitate cost-effective GWAS and GP studies in this NGS era. Code and documentation of Heap are freely available from https://github.com/meiji-bioinf/heap (29 March 2017, date last accessed) and our web site (http://bioinf.mind.meiji.ac.jp/lab/en/tools.html (29 March 2017, date last accessed)).
Collapse
Affiliation(s)
- Masaaki Kobayashi
- Bioinformatics Laboratory, Department of Life Sciences, School of Agriculture, Meiji University, Kanagawa 214-8571, Japan
| | - Hajime Ohyanagi
- Bioinformatics Laboratory, Department of Life Sciences, School of Agriculture, Meiji University, Kanagawa 214-8571, Japan.,King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Hideki Takanashi
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | - Satomi Asano
- Bioinformatics Laboratory, Department of Life Sciences, School of Agriculture, Meiji University, Kanagawa 214-8571, Japan
| | - Toru Kudo
- Bioinformatics Laboratory, Department of Life Sciences, School of Agriculture, Meiji University, Kanagawa 214-8571, Japan
| | - Hiromi Kajiya-Kanegae
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | - Atsushi J Nagano
- Faculty of Agriculture, Ryukoku University, Shiga 520-2194, Japan.,PRESTO, Japan Science and Technology Agency, Japan.,Center for Ecological Research, Kyoto University, Shiga 520-2113, Japan
| | - Hitoshi Tainaka
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | | | - Takashi Sazuka
- Bioscience and Biotechnology Center, Nagoya University, Aichi 464-8601, Japan
| | - Hiroyoshi Iwata
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | - Nobuhiro Tsutsumi
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | - Kentaro Yano
- Bioinformatics Laboratory, Department of Life Sciences, School of Agriculture, Meiji University, Kanagawa 214-8571, Japan
| |
Collapse
|
28
|
Abstract
Relatively little is known about the evolutionary history of the African green monkey (genus Chlorocebus) due to the lack of sampled polymorphism data from wild populations. Yet, this characterization of genetic diversity is not only critical for a better understanding of their own history, but also for human biomedical research given that they are one of the most widely used primate models. Here, I analyze the demographic and selective history of the African green monkey, utilizing one of the most comprehensive catalogs of wild genetic diversity to date, consisting of 1,795,643 autosomal single nucleotide polymorphisms in 25 individuals, representing all five major populations: C. a. aethiops, C. a. cynosurus, C. a. pygerythrus, C. a. sabaeus, and C. a tantalus. Assuming a mutation rate of 5.9 × 10-9 per base pair per generation and a generation time of 8.5 years, divergence time estimates range from 523 to 621 kya for the basal split of C. a. aethiops from the other four populations. Importantly, the resulting tree characterizing the relationship and split-times between these populations differs significantly from that presented in the original genome paper, owing to their neglect of within-population variation when calculating between population-divergence. In addition, I find that the demographic history of all five populations is well explained by a model of population fragmentation and isolation, rather than novel colonization events. Finally, utilizing these demographic models as a null, I investigate the selective history of the populations, identifying candidate regions potentially related to adaptation in response to pathogen exposure.
Collapse
Affiliation(s)
- Susanne P Pfeifer
- School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland.,School of Life Sciences, Arizona State University, Tempe, AZ
| |
Collapse
|
29
|
Navarro J, Nevado B, Hernández P, Vera G, Ramos-Onsins SE. Optimized Next-Generation Sequencing Genotype-Haplotype Calling for Genome Variability Analysis. Evol Bioinform Online 2017; 13:1176934317723884. [PMID: 28894353 PMCID: PMC5582667 DOI: 10.1177/1176934317723884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Accepted: 05/23/2017] [Indexed: 11/17/2022] Open
Abstract
The accurate estimation of nucleotide variability using next-generation sequencing data is challenged by the high number of sequencing errors produced by new sequencing technologies, especially for nonmodel species, where reference sequences may not be available and the read depth may be low due to limited budgets. The most popular single-nucleotide polymorphism (SNP) callers are designed to obtain a high SNP recovery and low false discovery rate but are not designed to account appropriately the frequency of the variants. Instead, algorithms designed to account for the frequency of SNPs give precise results for estimating the levels and the patterns of variability. These algorithms are focused on the unbiased estimation of the variability and not on the high recovery of SNPs. Here, we implemented a fast and optimized parallel algorithm that includes the method developed by Roesti et al and Lynch, which estimates the genotype of each individual at each site, considering the possibility to call both bases from the genotype, a single one or none. This algorithm does not consider the reference and therefore is independent of biases related to the reference nucleotide specified. The pipeline starts from a BAM file converted to pileup or mpileup format and the software outputs a FASTA file. The new program not only reduces the running times but also, given the improved use of resources, it allows its usage with smaller computers and large parallel computers, expanding its benefits to a wider range of researchers. The output file can be analyzed using software for population genetics analysis, such as the R library PopGenome, the software VariScan, and the program mstatspop for analysis considering positions with missing data.
Collapse
Affiliation(s)
- Javier Navarro
- Computer Architecture and Operating Systems Department, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Bruno Nevado
- Department of Plant Sciences, University of Oxford, Oxford, UK
| | - Porfidio Hernández
- Computer Architecture and Operating Systems Department, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Gonzalo Vera
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, Barcelona, Spain
| | | |
Collapse
|
30
|
|
31
|
Abstract
In order to leverage novel sequencing techniques for cloning genes in eukaryotic organisms with complex genomes, the false positive rate of variant discovery must be controlled for by experimental design and informatics. We sequenced five lines from three pedigrees of ethyl methanesulfonate (EMS)-mutagenized Sorghum bicolor, including a pedigree segregating a recessive dwarf mutant. Comparing the sequences of the lines, we were able to identify and eliminate error-prone positions. One genomic region contained EMS mutant alleles in dwarfs that were homozygous reference sequences in wild-type siblings and heterozygous in segregating families. This region contained a single nonsynonymous change that cosegregated with dwarfism in a validation population and caused a premature stop codon in the Sorghum ortholog encoding the gibberellic acid (GA) biosynthetic enzyme ent-kaurene oxidase. Application of exogenous GA rescued the mutant phenotype. Our method for mapping did not require outcrossing and introduced no segregation variance. This enables work when line crossing is complicated by life history, permitting gene discovery outside of genetic models. This inverts the historical approach of first using recombination to define a locus and then sequencing genes. Our formally identical approach first sequences all the genes and then seeks cosegregation with the trait. Mutagenized lines lacking obvious phenotypic alterations are available for an extension of this approach: mapping with a known marker set in a line that is phenotypically identical to starting material for EMS mutant generation.
Collapse
|
32
|
High-Throughput Resequencing of Maize Landraces at Genomic Regions Associated with Flowering Time. PLoS One 2017; 12:e0168910. [PMID: 28045987 PMCID: PMC5207663 DOI: 10.1371/journal.pone.0168910] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2016] [Accepted: 12/08/2016] [Indexed: 12/17/2022] Open
Abstract
Despite the reduction in the price of sequencing, it remains expensive to sequence and assemble whole, complex genomes of multiple samples for population studies, particularly for large genomes like those of many crop species. Enrichment of target genome regions coupled with next generation sequencing is a cost-effective strategy to obtain sequence information for loci of interest across many individuals, providing a less expensive approach to evaluating sequence variation at the population scale. Here we evaluate amplicon-based enrichment coupled with semiconductor sequencing on a validation set consisting of three maize inbred lines, two hybrids and 19 landrace accessions. We report the use of a multiplexed panel of 319 PCR assays that target 20 candidate loci associated with photoperiod sensitivity in maize while requiring 25 ng or less of starting DNA per sample. Enriched regions had an average on-target sequence read depth of 105 with 98% of the sequence data mapping to the maize ‘B73’ reference and 80% of the reads mapping to the target interval. Sequence reads were aligned to B73 and 1,486 and 1,244 variants were called using SAMtools and GATK, respectively. Of the variants called by both SAMtools and GATK, 30% were not previously reported in maize. Due to the high sequence read depth, heterozygote genotypes could be called with at least 92.5% accuracy in hybrid materials using GATK. The genetic data are congruent with previous reports of high total genetic diversity and substantial population differentiation among maize landraces. In conclusion, semiconductor sequencing of highly multiplexed PCR reactions is a cost-effective strategy for resequencing targeted genomic loci in diverse maize materials.
Collapse
|
33
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
34
|
Trends and Challenges in Pesticide Resistance Detection. TRENDS IN PLANT SCIENCE 2016; 21:834-853. [PMID: 27475253 DOI: 10.1016/j.tplants.2016.06.006] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/12/2016] [Revised: 06/15/2016] [Accepted: 06/18/2016] [Indexed: 06/06/2023]
Abstract
Pesticide resistance is a crucial factor to be considered when developing strategies for the minimal use of pesticides while maintaining pesticide efficacy. This goal requires monitoring the emergence and development of resistance to pesticides in crop pests. To this end, various methods for resistance diagnosis have been developed for different groups of pests. This review provides an overview of biological, biochemical, and molecular methods that are currently used to detect and quantify pesticide resistance. The agronomic, technical, and economic advantages and drawbacks of each method are considered. Emerging technologies are also described, with their associated challenges and their potential for the detection of resistance mechanisms likely to be selected by current and future plant protection methods.
Collapse
|
35
|
Tian S, Yan H, Neuhauser C, Slager SL. An analytical workflow for accurate variant discovery in highly divergent regions. BMC Genomics 2016; 17:703. [PMID: 27590916 PMCID: PMC5010666 DOI: 10.1186/s12864-016-3045-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Accepted: 08/25/2016] [Indexed: 02/07/2023] Open
Abstract
Background Current variant discovery methods often start with the mapping of short reads to a reference genome; yet, their performance deteriorates in genomic regions where the reads are highly divergent from the reference sequence. This is particularly problematic for the human leukocyte antigen (HLA) region on chromosome 6p21.3. This region is associated with over 100 diseases, but variant calling is hindered by the extreme divergence across different haplotypes. Results We simulated reads from chromosome 6 exonic regions over a wide range of sequence divergence and coverage depth. We systematically assessed combinations between five mappers and five callers for their performance on simulated data and exome-seq data from NA12878, a well-studied individual in which multiple public call sets have been generated. Among those combinations, the number of known SNPs differed by about 5 % in the non-HLA regions of chromosome 6 but over 20 % in the HLA region. Notably, GSNAP mapping combined with GATK UnifiedGenotyper calling identified about 20 % more known SNPs than most existing methods without a noticeable loss of specificity, with 100 % sensitivity in three highly polymorphic HLA genes examined. Much larger differences were observed among these combinations in INDEL calling from both non-HLA and HLA regions. We obtained similar results with our internal exome-seq data from a cohort of chronic lymphocytic leukemia patients. Conclusions We have established a workflow enabling variant detection, with high sensitivity and specificity, over the full spectrum of divergence seen in the human genome. Comparing to public call sets from NA12878 has highlighted the overall superiority of GATK UnifiedGenotyper, followed by GATK HaplotypeCaller and SAMtools, in SNP calling, and of GATK HaplotypeCaller and Platypus in INDEL calling, particularly in regions of high sequence divergence such as the HLA region. GSNAP and Novoalign are the ideal mappers in combination with the above callers. We expect that the proposed workflow should be applicable to variant discovery in other highly divergent regions. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3045-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Huihuang Yan
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Claudia Neuhauser
- Informatics Institute, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Susan L Slager
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA.
| |
Collapse
|
36
|
Heaton MP, Smith TPL, Carnahan JK, Basnayake V, Qiu J, Simpson B, Kalbfleisch TS. Using diverse U.S. beef cattle genomes to identify missense mutations in EPAS1, a gene associated with pulmonary hypertension. F1000Res 2016; 5:2003. [PMID: 27746904 PMCID: PMC5040160 DOI: 10.12688/f1000research.9254.2] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/04/2016] [Indexed: 01/08/2023] Open
Abstract
The availability of whole genome sequence (WGS) data has made it possible to discover protein variants
in silico. However, existing bovine WGS databases do not show data in a form conducive to protein variant analysis, and tend to under represent the breadth of genetic diversity in global beef cattle. Thus, our first aim was to use 96 beef sires, sharing minimal pedigree relationships, to create a searchable and publicly viewable set of mapped genomes relevant for 19 popular breeds of U.S. cattle. Our second aim was to identify protein variants encoded by the bovine endothelial PAS domain-containing protein 1 gene (
EPAS1), a gene associated with pulmonary hypertension in Angus cattle. The identity and quality of genomic sequences were verified by comparing WGS genotypes to those derived from other methods. The average read depth, genotype scoring rate, and genotype accuracy exceeded 14, 99%, and 99%, respectively. The 96 genomes were used to discover four amino acid variants encoded by
EPAS1 (E270Q, P362L, A671G, and L701F) and confirm two variants previously associated with disease (A606T and G610S). The six
EPAS1 missense mutations were verified with matrix-assisted laser desorption/ionization time-of-flight mass spectrometry assays, and their frequencies were estimated in a separate collection of 1154 U.S. cattle representing 46 breeds. A rooted phylogenetic tree of eight polypeptide sequences provided a framework for evaluating the likely order of mutations and potential impact of
EPAS1 alleles on the adaptive response to chronic hypoxia in U.S. cattle. This public, whole genome resource facilitates
in silico identification of protein variants in diverse types of U.S. beef cattle, and provides a means of translating WGS data into a practical biological and evolutionary context for generating and testing hypotheses.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Theodore S Kalbfleisch
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, USA
| |
Collapse
|
37
|
Heaton MP, Smith TP, Carnahan JK, Basnayake V, Qiu J, Simpson B, Kalbfleisch TS. Using diverse U.S. beef cattle genomes to identify missense mutations in EPAS1, a gene associated with pulmonary hypertension. F1000Res 2016; 5:2003. [PMID: 27746904 PMCID: PMC5040160 DOI: 10.12688/f1000research.9254.1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/04/2016] [Indexed: 08/24/2023] Open
Abstract
The availability of whole genome sequence (WGS) data has made it possible to discover protein variants in silico. However, existing bovine WGS databases do not show data in a form conducive to protein variant analysis, and tend to under represent the breadth of genetic diversity in global beef cattle. Thus, our first aim was to use 96 beef sires, sharing minimal pedigree relationships, to create a searchable and publicly viewable set of mapped genomes relevant for 19 popular breeds of U.S. cattle. Our second aim was to identify protein variants encoded by the bovine endothelial PAS domain-containing protein 1 gene ( EPAS1), a gene associated with pulmonary hypertension in Angus cattle. The identity and quality of genomic sequences were verified by comparing WGS genotypes to those derived from other methods. The average read depth, genotype scoring rate, and genotype accuracy exceeded 14, 99%, and 99%, respectively. The 96 genomes were used to discover four amino acid variants encoded by EPAS1 (E270Q, P362L, A671G, and L701F) and confirm two variants previously associated with disease (A606T and G610S). The six EPAS1 missense mutations were verified with matrix-assisted laser desorption/ionization time-of-flight mass spectrometry assays, and their frequencies were estimated in a separate collection of 1154 U.S. cattle representing 46 breeds. A rooted phylogenetic tree of eight polypeptide sequences provided a framework for evaluating the likely order of mutations and potential impact of EPAS1 alleles on the adaptive response to chronic hypoxia in U.S. cattle. This public, whole genome resource facilitates in silico identification of protein variants in diverse types of U.S. beef cattle, and provides a means of translating WGS data into a practical biological and evolutionary context for generating and testing hypotheses.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Theodore S. Kalbfleisch
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Louisville, Louisville, USA
| |
Collapse
|
38
|
Guggisberg AM, Sundararaman SA, Lanaspa M, Moraleda C, González R, Mayor A, Cisteró P, Hutchinson D, Kremsner PG, Hahn BH, Bassat Q, Odom AR. Whole-Genome Sequencing to Evaluate the Resistance Landscape Following Antimalarial Treatment Failure With Fosmidomycin-Clindamycin. J Infect Dis 2016; 214:1085-91. [PMID: 27443612 DOI: 10.1093/infdis/jiw304] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2016] [Accepted: 07/14/2016] [Indexed: 11/12/2022] Open
Abstract
Novel antimalarial therapies are needed in the face of emerging resistance to artemisinin combination therapies. A previous study found a high cure rate in Mozambican children with uncomplicated Plasmodium falciparum malaria 7 days after combination treatment with fosmidomycin-clindamycin. However, 28-day cure rates were low (45.9%), owing to parasite recrudescence. We sought to identify any genetic changes underlying parasite recrudescence. To this end, we used a selective whole-genome amplification method to amplify parasite genomes from blood spot DNA samples. Parasite genomes from pretreatment and postrecrudescence samples were subjected to whole-genome sequencing to identify nucleotide variants. Our data did not support the existence of a genetic change responsible for recrudescence following fosmidomycin-clindamycin treatment. Additionally, we found that previously described resistance alleles for these drugs do not represent biomarkers of recrudescence. Future studies should continue to optimize fosmidomycin combinations for use as antimalarial therapies.
Collapse
Affiliation(s)
| | - Sesh A Sundararaman
- Department of Medicine Department of Microbiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia
| | - Miguel Lanaspa
- Centro de Investigação em Saúde de Manhiça, Mozambique Barcelona Institute for Global Health, Barcelona Center for International Health Research, Hospital Clínic-Universitat de Barcelona, Spain
| | - Cinta Moraleda
- Centro de Investigação em Saúde de Manhiça, Mozambique Barcelona Institute for Global Health, Barcelona Center for International Health Research, Hospital Clínic-Universitat de Barcelona, Spain
| | - Raquel González
- Centro de Investigação em Saúde de Manhiça, Mozambique Barcelona Institute for Global Health, Barcelona Center for International Health Research, Hospital Clínic-Universitat de Barcelona, Spain
| | - Alfredo Mayor
- Centro de Investigação em Saúde de Manhiça, Mozambique Barcelona Institute for Global Health, Barcelona Center for International Health Research, Hospital Clínic-Universitat de Barcelona, Spain
| | - Pau Cisteró
- Barcelona Institute for Global Health, Barcelona Center for International Health Research, Hospital Clínic-Universitat de Barcelona, Spain
| | | | - Peter G Kremsner
- Institut für Tropenmedizin, University of Tübingen, Germany Centre de Recherches Médicales de Lambaréné, Gabon
| | - Beatrice H Hahn
- Department of Medicine Department of Microbiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia
| | - Quique Bassat
- Centro de Investigação em Saúde de Manhiça, Mozambique
| | - Audrey R Odom
- Department of Pediatrics Department of Molecular Microbiology, Washington University School of Medicine, St. Louis, Missouri
| |
Collapse
|
39
|
Reply to Wang et al.: Sequencing datasets do not refute Central Asian domestication origin of dogs. Proc Natl Acad Sci U S A 2016; 113:E2556-7. [PMID: 27099288 DOI: 10.1073/pnas.1600618113] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
40
|
Dickson DJ, Pfeifer JD. Real-world data in the molecular era-finding the reality in the real world. Clin Pharmacol Ther 2016; 99:186-97. [PMID: 26565654 DOI: 10.1002/cpt.300] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2015] [Accepted: 11/10/2015] [Indexed: 01/06/2023]
Abstract
Real-world data (RWD) promises to provide a pivotal element to the understanding of personalized medicine. However, without true representation (or the reality) of the patient-disease biosystem and its molecular contributors, RWD may hamper rather than help this advancement. In this review article, we discuss RWD vs. clinical reality and the disconnects that exist currently (emphasizing molecular medicine), and methods of closing the gaps between RWD and reality.
Collapse
Affiliation(s)
- D J Dickson
- Molecular Evidence Development Consortium, Rexburg, Idaho, USA
| | - J D Pfeifer
- Department of Pathology, Washington University School of Medicine, St. Louis, Missouri, USA
| |
Collapse
|
41
|
Gézsi A, Bolgár B, Marx P, Sarkozy P, Szalai C, Antal P. VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precision-based filtering. BMC Genomics 2015; 16:875. [PMID: 26510841 PMCID: PMC4625715 DOI: 10.1186/s12864-015-2050-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 10/06/2015] [Indexed: 11/10/2022] Open
Abstract
Background The low concordance between different variant calling methods still poses a challenge for the wide-spread application of next-generation sequencing in research and clinical practice. A wide range of variant annotations can be used for filtering call sets in order to improve the precision of the variant calls, but the choice of the appropriate filtering thresholds is not straightforward. Variant quality score recalibration provides an alternative solution to hard filtering, but it requires large-scale, genomic data. Results We evaluated germline variant calling pipelines based on BWA and Bowtie 2 aligners in combination with GATK UnifiedGenotyper, GATK HaplotypeCaller, FreeBayes and SAMtools variant callers, using simulated and real benchmark sequencing data (NA12878 with Illumina Platinum Genomes). We argue that these pipelines are not merely discordant, but they extract complementary useful information. We introduce VariantMetaCaller to test the hypothesis that the automated fusion of measurement related information allows better performance than the recommended hard-filtering settings or recalibration and the fusion of the individual call sets without using annotations. VariantMetaCaller uses Support Vector Machines to combine multiple information sources generated by variant calling pipelines and estimates probabilities of variants. This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes. We also demonstrated that VariantMetaCaller supports a quantitative, precision based filtering of variants under wider conditions. Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision. Precision then can be directly translated to the number of true called variants, or equivalently, to the number of false calls, which allows finding problem-specific balance between sensitivity and precision. Conclusions VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used. VariantMetaCaller is freely available at http://bioinformatics.mit.bme.hu/VariantMetaCaller. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2050-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- András Gézsi
- Department of Genetics, Cell- and Immunobiology, Semmelweis University, Nagyvárad tér 4, Budapest, H-1089, Hungary. .,Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Bence Bolgár
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Péter Marx
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Peter Sarkozy
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Csaba Szalai
- Department of Genetics, Cell- and Immunobiology, Semmelweis University, Nagyvárad tér 4, Budapest, H-1089, Hungary.
| | - Péter Antal
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| |
Collapse
|
42
|
Ni G, Strom TM, Pausch H, Reimer C, Preisinger R, Simianer H, Erbe M. Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken. BMC Genomics 2015; 16:824. [PMID: 26486989 PMCID: PMC4618161 DOI: 10.1186/s12864-015-2059-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2015] [Accepted: 10/09/2015] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND The technical progress in the last decade has made it possible to sequence millions of DNA reads in a relatively short time frame. Several variant callers based on different algorithms have emerged and have made it possible to extract single nucleotide polymorphisms (SNPs) out of the whole-genome sequence. Often, only a few individuals of a population are sequenced completely and imputation is used to obtain genotypes for all sequence-based SNP loci for other individuals, which have been genotyped for a subset of SNPs using a genotyping array. METHODS First, we compared the sets of variants detected with different variant callers, namely GATK, freebayes and SAMtools, and checked the quality of genotypes of the called variants in a set of 50 fully sequenced white and brown layers. Second, we assessed the imputation accuracy (measured as the correlation between imputed and true genotype per SNP and per individual, and genotype conflict between father-progeny pairs) when imputing from high density SNP array data to whole-genome sequence using data from around 1000 individuals from six different generations. Three different imputation programs (Minimac, FImpute and IMPUTE2) were checked in different validation scenarios. RESULTS There were 1,741,573 SNPs detected by all three callers on the studied chromosomes 3, 6, and 28, which was 71.6 % (81.6 %, 88.0 %) of SNPs detected by GATK (SAMtools, freebayes) in total. Genotype concordance (GC) defined as the proportion of individuals whose array-derived genotypes are the same as the sequence-derived genotypes over all non-missing SNPs on the array were 0.98 (GATK), 0.97 (freebayes) and 0.98 (SAMtools). Furthermore, the percentage of variants that had high values (>0.9) for another three measures (non-reference sensitivity, non-reference genotype concordance and precision) were 90 (88, 75) for GATK (SAMtools, freebayes). With all imputation programs, correlation between original and imputed genotypes was >0.95 on average with randomly masked 1000 SNPs from the SNP array and >0.85 for a leave-one-out cross-validation within sequenced individuals. CONCLUSIONS Performance of all variant callers studied was very good in general, particularly for GATK and SAMtools. FImpute performed slightly worse than Minimac and IMPUTE2 in terms of genotype correlation, especially for SNPs with low minor allele frequency, while it had lowest numbers in Mendelian conflicts in available father-progeny pairs. Correlations of real and imputed genotypes remained constantly high even if individuals to be imputed were several generations away from the sequenced individuals.
Collapse
Affiliation(s)
- Guiyan Ni
- Animal Breeding and Genetics Group, Georg-August-Universität, Göttingen, Germany.
| | - Tim M Strom
- Institute of Human Genetics, Helmholtz Zentrum München, Neuherberg, Germany.
| | - Hubert Pausch
- Chair of Animal Breeding, Technische Universität München, Freising, Germany.
| | - Christian Reimer
- Animal Breeding and Genetics Group, Georg-August-Universität, Göttingen, Germany.
| | | | - Henner Simianer
- Animal Breeding and Genetics Group, Georg-August-Universität, Göttingen, Germany.
| | - Malena Erbe
- Animal Breeding and Genetics Group, Georg-August-Universität, Göttingen, Germany. .,Institute for Animal Breeding, Bavarian State Research Centre for Agriculture, Grub, Germany.
| |
Collapse
|
43
|
Krasnov GS, Dmitriev AA, Kudryavtseva AV, Shargunov AV, Karpov DS, Uroshlev LA, Melnikova NV, Blinov VM, Poverennaya EV, Archakov AI, Lisitsa AV, Ponomarenko EA. PPLine: An Automated Pipeline for SNP, SAP, and Splice Variant Detection in the Context of Proteogenomics. J Proteome Res 2015; 14:3729-37. [DOI: 10.1021/acs.jproteome.5b00490] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- George Sergeevich Krasnov
- Engelhardt
Institute of Molecular Biology, Russian Academy of Sciences, Moscow, 111991 Russia
- Orekhovich
Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, 119121 Russia
- Mechnikov Research Institute of Vaccines and Sera, Moscow, 105064 Russia
| | | | - Anna Viktorovna Kudryavtseva
- Engelhardt
Institute of Molecular Biology, Russian Academy of Sciences, Moscow, 111991 Russia
- Herzen
Moscow Cancer Research Institute, Ministry of Healthcare of the Russian Federation, Moscow, 125284 Russia
| | - Alexander Valerievich Shargunov
- Orekhovich
Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, 119121 Russia
- Mechnikov Research Institute of Vaccines and Sera, Moscow, 105064 Russia
| | - Dmitry Sergeevich Karpov
- Engelhardt
Institute of Molecular Biology, Russian Academy of Sciences, Moscow, 111991 Russia
- Orekhovich
Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, 119121 Russia
| | | | | | - Vladimir Mikhailovich Blinov
- Orekhovich
Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, 119121 Russia
- Mechnikov Research Institute of Vaccines and Sera, Moscow, 105064 Russia
| | | | | | - Andrey Valerievich Lisitsa
- Orekhovich
Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, 119121 Russia
| | | |
Collapse
|
44
|
Muraya MM, Schmutzer T, Ulpinnis C, Scholz U, Altmann T. Targeted Sequencing Reveals Large-Scale Sequence Polymorphism in Maize Candidate Genes for Biomass Production and Composition. PLoS One 2015; 10:e0132120. [PMID: 26151830 PMCID: PMC4495061 DOI: 10.1371/journal.pone.0132120] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2015] [Accepted: 06/10/2015] [Indexed: 12/30/2022] Open
Abstract
A major goal of maize genomic research is to identify sequence polymorphisms responsible for phenotypic variation in traits of economic importance. Large-scale detection of sequence variation is critical for linking genes, or genomic regions, to phenotypes. However, due to its size and complexity, it remains expensive to generate whole genome sequences of sufficient coverage for divergent maize lines, even with access to next generation sequencing (NGS) technology. Because methods involving reduction of genome complexity, such as genotyping-by-sequencing (GBS), assess only a limited fraction of sequence variation, targeted sequencing of selected genomic loci offers an attractive alternative. We therefore designed a sequence capture assay to target 29 Mb genomic regions and surveyed a total of 4,648 genes possibly affecting biomass production in 21 diverse inbred maize lines (7 flints, 14 dents). Captured and enriched genomic DNA was sequenced using the 454 NGS platform to 19.6-fold average depth coverage, and a broad evaluation of read alignment and variant calling methods was performed to select optimal procedures for variant discovery. Sequence alignment with the B73 reference and de novo assembly identified 383,145 putative single nucleotide polymorphisms (SNPs), of which 42,685 were non-synonymous alterations and 7,139 caused frameshifts. Presence/absence variation (PAV) of genes was also detected. We found that substantial sequence variation exists among genomic regions targeted in this study, which was particularly evident within coding regions. This diversification has the potential to broaden functional diversity and generate phenotypic variation that may lead to new adaptations and the modification of important agronomic traits. Further, annotated SNPs identified here will serve as useful genetic tools and as candidates in searches for phenotype-altering DNA variation. In summary, we demonstrated that sequencing of captured DNA is a powerful approach for variant discovery in maize genes.
Collapse
Affiliation(s)
- Moses M. Muraya
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Corrensstraße 3, D-06466, Stadt Seeland, Germany
- Department of Plant Science, Chuka University, P.O. Box, 109–60400, Chuka, Kenya
| | - Thomas Schmutzer
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Corrensstraße 3, D-06466, Stadt Seeland, Germany
- * E-mail:
| | - Chris Ulpinnis
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Corrensstraße 3, D-06466, Stadt Seeland, Germany
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Corrensstraße 3, D-06466, Stadt Seeland, Germany
| | - Thomas Altmann
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Corrensstraße 3, D-06466, Stadt Seeland, Germany
| |
Collapse
|
45
|
Recessive mutations in POLR1C cause a leukodystrophy by impairing biogenesis of RNA polymerase III. Nat Commun 2015; 6:7623. [PMID: 26151409 PMCID: PMC4506509 DOI: 10.1038/ncomms8623] [Citation(s) in RCA: 128] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Accepted: 05/26/2015] [Indexed: 12/20/2022] Open
Abstract
A small proportion of 4H (Hypomyelination, Hypodontia and Hypogonadotropic Hypogonadism) or RNA polymerase III (POLR3)-related leukodystrophy cases are negative for mutations in the previously identified causative genes POLR3A and POLR3B. Here we report eight of these cases carrying recessive mutations in POLR1C, a gene encoding a shared POLR1 and POLR3 subunit, also mutated in some Treacher Collins syndrome (TCS) cases. Using shotgun proteomics and ChIP sequencing, we demonstrate that leukodystrophy-causative mutations, but not TCS mutations, in POLR1C impair assembly and nuclear import of POLR3, but not POLR1, leading to decreased binding to POLR3 target genes. This study is the first to show that distinct mutations in a gene coding for a shared subunit of two RNA polymerases lead to selective modification of the enzymes' availability leading to two different clinical conditions and to shed some light on the pathophysiological mechanism of one of the most common hypomyelinating leukodystrophies, POLR3-related leukodystrophy.
Collapse
|
46
|
Willet CE, Haase B, Charleston MA, Wade CM. Simple, rapid and accurate genotyping-by-sequencing from aligned whole genomes with ArrayMaker. Bioinformatics 2015; 31:599-601. [PMID: 25336502 DOI: 10.1093/bioinformatics/btu691] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Whole-genome sequencing has revolutionized the study of genetics. Genotyping-by-sequencing is now a viable method of genotyping, yet the bioinformatics involved can be daunting if not prohibitive for some laboratories. Here we present ArrayMaker, a user-friendly tool that extracts accurate single nucleotide polymorphism genotypes at pre-defined loci from whole-genome alignments and presents them in a standard genotyping format compatible with association analysis software and datasets genotyped on commercial array platforms. Using this tool, geneticists with only basic computing ability can genotype samples at any desired list of markers, facilitating genome-wide association analysis, fine mapping, candidate variant assessment, data sharing and compatibility of data sourced from multiple technologies. AVAILABILITY AND IMPLEMENTATION ArrayMaker is licensed under The MIT License and can be freely obtained at https://github.com/cw2014/ArrayMaker/. The program is implemented in Perl and runs on Linux operating systems. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. CONTACT cali.willet@sydney.edu.au.
Collapse
Affiliation(s)
- Cali E Willet
- Faculty of Veterinary Science and School of Information Technologies, University of Sydney, Sydney, New South Wales 2006, Australia
| | - Bianca Haase
- Faculty of Veterinary Science and School of Information Technologies, University of Sydney, Sydney, New South Wales 2006, Australia
| | - Michael A Charleston
- Faculty of Veterinary Science and School of Information Technologies, University of Sydney, Sydney, New South Wales 2006, Australia
| | - Claire M Wade
- Faculty of Veterinary Science and School of Information Technologies, University of Sydney, Sydney, New South Wales 2006, Australia
| |
Collapse
|
47
|
Roux PF, Marthey S, Djari A, Moroldo M, Esquerré D, Estellé J, Klopp C, Lagarrigue S, Demeure O. Comparison of whole-genome (13X) and capture (87X) resequencing methods for SNP and genotype callings. Anim Genet 2014; 46:82-6. [PMID: 25515399 DOI: 10.1111/age.12248] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/03/2014] [Indexed: 12/30/2022]
Abstract
The number of polymorphisms identified with next-generation sequencing approaches depends directly on the sequencing depth and therefore on the experimental cost. Although higher levels of depth ensure more sensitive and more specific SNP calls, economic constraints limit the increase of depth for whole-genome resequencing (WGS). For this reason, capture resequencing is used for studies focusing on only some specific regions of the genome. However, several biases in capture resequencing are known to have a negative impact on the sensitivity of SNP detection. Within this framework, the aim of this study was to compare the accuracy of WGS and capture resequencing on SNP detection and genotype calling, which differ in terms of both sequencing depth and biases. Indeed, we have evaluated the SNP calling and genotyping accuracy in a WGS dataset (13X) and in a capture resequencing dataset (87X) performed on 11 individuals. The percentage of SNPs not identified due to a sevenfold sequencing depth decrease was estimated at 7.8% using a down-sampling procedure on the capture sequencing dataset. A comparison of the 87X capture sequencing dataset with the WGS dataset revealed that capture-related biases were leading with the loss of 5.2% of SNPs detected with WGS. Nevertheless, when considering the SNPs detected by both approaches, capture sequencing appears to achieve far better SNP genotyping, with about 4.4% of the WGS genotypes that can be considered as erroneous and even 10% focusing on heterozygous genotypes. In conclusion, WGS and capture deep sequencing can be considered equivalent strategies for SNP detection, as the rate of SNPs not identified because of a low sequencing depth in the former is quite similar to SNPs missed because of method biases of the latter. On the other hand, capture deep sequencing clearly appears more adapted for studies requiring great accuracy in genotyping.
Collapse
Affiliation(s)
- P F Roux
- INRA, UMR1348 PEGASE, Saint-Gilles, F-35590, France; Agrocampus Ouest, UMR1348 PEGASE, Rennes, F-35000, France; Université Européenne de Bretagne, Rennes, France
| | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Krug K, Popic S, Carpy A, Taumer C, Macek B. Construction and assessment of individualized proteogenomic databases for large-scale analysis of nonsynonymous single nucleotide variants. Proteomics 2014; 14:2699-708. [DOI: 10.1002/pmic.201400219] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2014] [Revised: 08/02/2014] [Accepted: 09/19/2014] [Indexed: 01/08/2023]
Affiliation(s)
- Karsten Krug
- Proteome Center Tuebingen; University of Tuebingen; Germany
| | - Sasa Popic
- Proteome Center Tuebingen; University of Tuebingen; Germany
| | | | | | - Boris Macek
- Proteome Center Tuebingen; University of Tuebingen; Germany
| |
Collapse
|
49
|
Baes CF, Dolezal MA, Koltes JE, Bapst B, Fritz-Waters E, Jansen S, Flury C, Signer-Hasler H, Stricker C, Fernando R, Fries R, Moll J, Garrick DJ, Reecy JM, Gredler B. Evaluation of variant identification methods for whole genome sequencing data in dairy cattle. BMC Genomics 2014; 15:948. [PMID: 25361890 PMCID: PMC4289218 DOI: 10.1186/1471-2164-15-948] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2014] [Accepted: 10/14/2014] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Advances in human genomics have allowed unprecedented productivity in terms of algorithms, software, and literature available for translating raw next-generation sequence data into high-quality information. The challenges of variant identification in organisms with lower quality reference genomes are less well documented. We explored the consequences of commonly recommended preparatory steps and the effects of single and multi sample variant identification methods using four publicly available software applications (Platypus, HaplotypeCaller, Samtools and UnifiedGenotyper) on whole genome sequence data of 65 key ancestors of Swiss dairy cattle populations. Accuracy of calling next-generation sequence variants was assessed by comparison to the same loci from medium and high-density single nucleotide variant (SNV) arrays. RESULTS The total number of SNVs identified varied by software and method, with single (multi) sample results ranging from 17.7 to 22.0 (16.9 to 22.0) million variants. Computing time varied considerably between software. Preparatory realignment of insertions and deletions and subsequent base quality score recalibration had only minor effects on the number and quality of SNVs identified by different software, but increased computing time considerably. Average concordance for single (multi) sample results with high-density chip data was 58.3% (87.0%) and average genotype concordance in correctly identified SNVs was 99.2% (99.2%) across software. The average quality of SNVs identified, measured as the ratio of transitions to transversions, was higher using single sample methods than multi sample methods. A consensus approach using results of different software generally provided the highest variant quality in terms of transition/transversion ratio. CONCLUSIONS Our findings serve as a reference for variant identification pipeline development in non-human organisms and help assess the implication of preparatory steps in next-generation sequencing pipelines for organisms with incomplete reference genomes (pipeline code is included). Benchmarking this information should prove particularly useful in processing next-generation sequencing data for use in genome-wide association studies and genomic selection.
Collapse
Affiliation(s)
- Christine F Baes
- Bern University of Applied Sciences, School of Agricultural, Forest and Food Sciences HAFL, Länggasse 85, CH-3052 Zollikofen, Switzerland.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
50
|
Pettengill JB, Luo Y, Davis S, Chen Y, Gonzalez-Escalona N, Ottesen A, Rand H, Allard MW, Strain E. An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella. PeerJ 2014; 2:e620. [PMID: 25332847 PMCID: PMC4201946 DOI: 10.7717/peerj.620] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Accepted: 09/23/2014] [Indexed: 11/20/2022] Open
Abstract
Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome sequences representing 107 unique Salmonella enterica subsp. enterica ser. Montevideo strains. Reference-based approaches for identifying SNPs produced trees that were significantly more similar to one another than those produced under the reference-free approach. Topologies inferred using a core matrix (i.e., no missing data) were significantly more discordant than those inferred using a non-core matrix that allows for some missing data. However, allowing for too much missing data likely results in a high false discovery rate of SNPs. When analyzing the same SNP matrix, we observed that the more thorough inference methods implemented in GARLI and RAxML produced more similar topologies than FastTreeMP. Our results also confirm that reproducibility varies among NGS platforms where the MiSeq had the lowest number of pairwise differences among replicate runs. Our investigation into the robustness of clustering patterns illustrates the importance of carefully considering how data from different platforms are combined and analyzed. We found clear differences in the topologies inferred, and certain methods performed significantly better than others for discriminating between the highly clonal organisms investigated here. The methods supported by our results represent a preliminary set of guidelines and a step towards developing validated standards for clustering based on whole genome sequence data.
Collapse
Affiliation(s)
- James B Pettengill
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Yan Luo
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Steven Davis
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Yi Chen
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Narjol Gonzalez-Escalona
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Andrea Ottesen
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Hugh Rand
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Marc W Allard
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| | - Errol Strain
- Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration , College Park, MD , USA
| |
Collapse
|