1
|
Betschart RO, Riccio C, Aguilera-Garcia D, Blankenberg S, Guo L, Moch H, Seidl D, Solleder H, Thalén F, Thiéry A, Twerenbold R, Zeller T, Zoche M, Ziegler A. Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control. Biom J 2024; 66:e202300278. [PMID: 38988195 DOI: 10.1002/bimj.202300278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 03/21/2024] [Accepted: 05/14/2024] [Indexed: 07/12/2024]
Abstract
Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg-Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.
Collapse
Affiliation(s)
| | | | - Domingo Aguilera-Garcia
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Stefan Blankenberg
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Linlin Guo
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Holger Moch
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Dagmar Seidl
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Hugo Solleder
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | - Felix Thalén
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | | | - Raphael Twerenbold
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), partner site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Tanja Zeller
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), partner site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Martin Zoche
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Andreas Ziegler
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa
| |
Collapse
|
2
|
Frontanilla TS, Valle-Silva G, Ayala J, Mendes-Junior CT. Open-Access Worldwide Population STR Database Constructed Using High-Coverage Massively Parallel Sequencing Data Obtained from the 1000 Genomes Project. Genes (Basel) 2022; 13:genes13122205. [PMID: 36553472 PMCID: PMC9778533 DOI: 10.3390/genes13122205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 11/13/2022] [Accepted: 11/21/2022] [Indexed: 11/27/2022] Open
Abstract
Achieving accurate STR genotyping by using next-generation sequencing data has been challenging. To provide the forensic genetics community with a reliable open-access STR database, we conducted a comprehensive genotyping analysis of a set of STRs of broad forensic interest obtained from 1000 Genome populations. We analyzed 22 STR markers using files of the high-coverage dataset of Phase 3 of the 1000 Genomes Project. We used HipSTR to call genotypes from 2504 samples obtained from 26 populations. We were not able to detect the D21S11 marker. The Hardy-Weinberg equilibrium analysis coupled with a comprehensive analysis of allele frequencies revealed that HipSTR was not able to identify longer alleles, which resulted in heterozygote deficiency. Nevertheless, AMOVA, a clustering analysis that uses STRUCTURE, and a Principal Coordinates Analysis showed a clear-cut separation between the four major ancestries sampled by the 1000 Genomes Consortium. Except for larger Penta D and Penta E alleles, and two very small Penta D alleles (2.2 and 3.2) usually observed in African populations, our analyses revealed that allele frequencies and genotypes offered as an open-access database are consistent and reliable.
Collapse
Affiliation(s)
- Tamara Soledad Frontanilla
- Departamento de Genética, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto 14049-900, SP, Brazil
| | - Guilherme Valle-Silva
- Departamento de Química, Laboratório de Pesquisas Forenses e Genômicas, Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto 14040-901, SP, Brazil
| | - Jesus Ayala
- Facultad de Ingeniería Informática, Universidad de la Integración de las Americas, Asunción 00120-6, Paraguay
| | - Celso Teixeira Mendes-Junior
- Departamento de Química, Laboratório de Pesquisas Forenses e Genômicas, Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto 14040-901, SP, Brazil
- Correspondence:
| |
Collapse
|
3
|
Huttener R, Thorrez L, Veld TI, Granvik M, Van Lommel L, Waelkens E, Derua R, Lemaire K, Goyvaerts L, De Coster S, Buyse J, Schuit F. Sequencing refractory regions in bird genomes are hotspots for accelerated protein evolution. BMC Ecol Evol 2021; 21:176. [PMID: 34537008 PMCID: PMC8449477 DOI: 10.1186/s12862-021-01905-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2020] [Accepted: 08/31/2021] [Indexed: 11/29/2022] Open
Abstract
Background Approximately 1000 protein encoding genes common for vertebrates are still unannotated in avian genomes. Are these genes evolutionary lost or are they not yet found for technical reasons? Using genome landscapes as a tool to visualize large-scale regional effects of genome evolution, we reexamined this question. Results On basis of gene annotation in non-avian vertebrate genomes, we established a list of 15,135 common vertebrate genes. Of these, 1026 were not found in any of eight examined bird genomes. Visualizing regional genome effects by our sliding window approach showed that the majority of these "missing" genes can be clustered to 14 regions of the human reference genome. In these clusters, an additional 1517 genes (often gene fragments) were underrepresented in bird genomes. The clusters of “missing” genes coincided with regions of very high GC content, particularly in avian genomes, making them “hidden” because of incomplete sequencing. Moreover, proteins encoded by genes in these sequencing refractory regions showed signs of accelerated protein evolution. As a proof of principle for this idea we experimentally characterized the mRNA and protein products of four "hidden" bird genes that are crucial for energy homeostasis in skeletal muscle: ALDOA, ENO3, PYGM and SLC2A4. Conclusions A least part of the “missing” genes in bird genomes can be attributed to an artifact caused by the difficulty to sequence regions with extreme GC% (“hidden” genes). Biologically, these “hidden” genes are of interest as they encode proteins that evolve more rapidly than the genome wide average. Finally we show that four of these “hidden” genes encode key proteins for energy metabolism in flight muscle. Supplementary Information The online version contains supplementary material available at 10.1186/s12862-021-01905-7.
Collapse
Affiliation(s)
- R Huttener
- Gene Expression Unit, Department of Cellular and Molecular Medicine, KU Leuven, Herestraat 49, O&N1, bus 901, 3000, Leuven, Belgium
| | - L Thorrez
- Gene Expression Unit, Department of Cellular and Molecular Medicine, KU Leuven, Herestraat 49, O&N1, bus 901, 3000, Leuven, Belgium.,Tissue Engineering Laboratory, Department of Development and Regeneration, KU Leuven Campus Kulak, Kortrijk, Belgium
| | - T In't Veld
- Gene Expression Unit, Department of Cellular and Molecular Medicine, KU Leuven, Herestraat 49, O&N1, bus 901, 3000, Leuven, Belgium
| | - M Granvik
- Gene Expression Unit, Department of Cellular and Molecular Medicine, KU Leuven, Herestraat 49, O&N1, bus 901, 3000, Leuven, Belgium
| | - L Van Lommel
- Gene Expression Unit, Department of Cellular and Molecular Medicine, KU Leuven, Herestraat 49, O&N1, bus 901, 3000, Leuven, Belgium
| | - E Waelkens
- Laboratory of Protein Phosphorylation and Proteomics, KU Leuven, Leuven, Belgium
| | - R Derua
- Laboratory of Protein Phosphorylation and Proteomics, KU Leuven, Leuven, Belgium
| | - K Lemaire
- Gene Expression Unit, Department of Cellular and Molecular Medicine, KU Leuven, Herestraat 49, O&N1, bus 901, 3000, Leuven, Belgium
| | - L Goyvaerts
- Gene Expression Unit, Department of Cellular and Molecular Medicine, KU Leuven, Herestraat 49, O&N1, bus 901, 3000, Leuven, Belgium
| | - S De Coster
- Gene Expression Unit, Department of Cellular and Molecular Medicine, KU Leuven, Herestraat 49, O&N1, bus 901, 3000, Leuven, Belgium
| | - J Buyse
- Laboratory of Livestock Physiology, Department of Biosystems, KU Leuven, Leuven, Belgium
| | - F Schuit
- Gene Expression Unit, Department of Cellular and Molecular Medicine, KU Leuven, Herestraat 49, O&N1, bus 901, 3000, Leuven, Belgium.
| |
Collapse
|
4
|
Marina H, Chitneedi P, Pelayo R, Suárez-Vega A, Esteban-Blanco C, Gutiérrez-Gil B, Arranz JJ. Study on the concordance between different SNP-genotyping platforms in sheep. Anim Genet 2021; 52:868-880. [PMID: 34515357 DOI: 10.1111/age.13139] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/28/2021] [Indexed: 12/12/2022]
Abstract
Different SNP genotyping technologies are commonly used in multiple studies to perform QTL detection, genotype imputation, and genomic predictions. Therefore, genotyping errors cannot be ignored, as they can reduce the accuracy of different procedures applied in genomic selection, such as genomic imputation, genomic predictions, and false-positive results in genome-wide association studies. Currently, whole-genome resequencing (WGR) also offers the potential for variant calling analysis and high-throughput genotyping. WGR might overshadow array-based genotyping technologies due to the larger amount and precision of the genomic information provided; however, its comparatively higher price per individual still limits its use in larger populations. Thus, the objective of this work was to evaluate the accuracy of the two most popular SNP-chip technologies, namely, Affymetrix and Illumina, for high-throughput genotyping in sheep considering high-coverage WGR datasets as references. Analyses were performed using two reference sheep genome assemblies, the popular Oar_v3.1 reference genome and the latest available version Oar_rambouillet_v1.0. Our results demonstrate that the genotypes from both platforms are suggested to have high concordance rates with the genotypes determined from reference WGR datasets (96.59% and 99.51% for Affymetrix and Illumina technologies, respectively). The concordance results provided in the current study can pinpoint low reproducible markers across multiple platforms used for sheep genotyping data. Comparing results using two reference genome assemblies also informs how genome assembly quality can influence genotype concordance rates among different genotyping platforms. Moreover, we describe an efficient pipeline to test the reliability of markers included in sheep SNP-chip panels against WGR datasets available on public databases. This pipeline may be helpful for discarding low-reliability markers before exploiting genomic information for gene mapping analyses or genomic prediction.
Collapse
Affiliation(s)
- H Marina
- Departamento de Producción Animal, Facultad de Veterinaria, Universidad de León, Campus de Vegazana s/n, León, 24071, Spain
| | - P Chitneedi
- Departamento de Producción Animal, Facultad de Veterinaria, Universidad de León, Campus de Vegazana s/n, León, 24071, Spain
| | - R Pelayo
- Departamento de Producción Animal, Facultad de Veterinaria, Universidad de León, Campus de Vegazana s/n, León, 24071, Spain
| | - A Suárez-Vega
- Departamento de Producción Animal, Facultad de Veterinaria, Universidad de León, Campus de Vegazana s/n, León, 24071, Spain
| | - C Esteban-Blanco
- Departamento de Producción Animal, Facultad de Veterinaria, Universidad de León, Campus de Vegazana s/n, León, 24071, Spain
| | - B Gutiérrez-Gil
- Departamento de Producción Animal, Facultad de Veterinaria, Universidad de León, Campus de Vegazana s/n, León, 24071, Spain
| | - J J Arranz
- Departamento de Producción Animal, Facultad de Veterinaria, Universidad de León, Campus de Vegazana s/n, León, 24071, Spain
| |
Collapse
|
5
|
Abstract
Tumour formation involves random mutagenic events and positive evolutionary selection acting on a subset of such events, referred to as driver mutations. A decade of careful surveying of tumour DNA using exome-based analyses has revealed a multitude of protein-coding somatic driver mutations, some of which are clinically actionable. Today, a transition towards whole-genome analysis is well under way, technically enabling the discovery of potential driver mutations occurring outside protein-coding sequences. Mutations are abundant in this vast non-coding space, which is more than 50 times larger than the coding exome, but reliable identification of selection signals in non-coding DNA remains a challenge. In this Review, we discuss recent findings in the field, where the emerging landscape is one in which non-coding driver mutations appear to be relatively infrequent. Nevertheless, we highlight several notable discoveries. We consider possible reasons for the relative absence of non-coding driver events, as well as the difficulties associated with detecting signals of positive selection in non-coding DNA.
Collapse
Affiliation(s)
- Kerryn Elliott
- Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, Sahlgrenska Academy at University of Gothenburg, Gothenburg, Sweden
| | - Erik Larsson
- Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, Sahlgrenska Academy at University of Gothenburg, Gothenburg, Sweden.
| |
Collapse
|
6
|
Mukhtar M, Sargazi S, Barani M, Madry H, Rahdar A, Cucchiarini M. Application of Nanotechnology for Sensitive Detection of Low-Abundance Single-Nucleotide Variations in Genomic DNA: A Review. NANOMATERIALS (BASEL, SWITZERLAND) 2021; 11:1384. [PMID: 34073904 PMCID: PMC8225127 DOI: 10.3390/nano11061384] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 05/20/2021] [Accepted: 05/21/2021] [Indexed: 01/02/2023]
Abstract
Single-nucleotide polymorphisms (SNPs) are the simplest and most common type of DNA variations in the human genome. This class of attractive genetic markers, along with point mutations, have been associated with the risk of developing a wide range of diseases, including cancer, cardiovascular diseases, autoimmune diseases, and neurodegenerative diseases. Several existing methods to detect SNPs and mutations in body fluids have faced limitations. Therefore, there is a need to focus on developing noninvasive future polymerase chain reaction (PCR)-free tools to detect low-abundant SNPs in such specimens. The detection of small concentrations of SNPs in the presence of a large background of wild-type genes is the biggest hurdle. Hence, the screening and detection of SNPs need efficient and straightforward strategies. Suitable amplification methods are being explored to avoid high-throughput settings and laborious efforts. Therefore, currently, DNA sensing methods are being explored for the ultrasensitive detection of SNPs based on the concept of nanotechnology. Owing to their small size and improved surface area, nanomaterials hold the extensive capacity to be used as biosensors in the genotyping and highly sensitive recognition of single-base mismatch in the presence of incomparable wild-type DNA fragments. Different nanomaterials have been combined with imaging and sensing techniques and amplification methods to facilitate the less time-consuming and easy detection of SNPs in different diseases. This review aims to highlight some of the most recent findings on the aspects of nanotechnology-based SNP sensing methods used for the specific and ultrasensitive detection of low-concentration SNPs and rare mutations.
Collapse
Affiliation(s)
- Mahwash Mukhtar
- Faculty of Pharmacy, Institute of Pharmaceutical Technology and Regulatory Affairs, University of Szeged, 6720 Szeged, Hungary;
| | - Saman Sargazi
- Cellular and Molecular Research Center, Resistant Tuberculosis Institute, Zahedan University of Medical Sciences, Zahedan 98167-43463, Iran;
| | - Mahmood Barani
- Department of Chemistry, Shahid Bahonar University of Kerman, Kerman 76169-14111, Iran;
| | - Henning Madry
- Center of Experimental Orthopaedics, Saarland University Medical Center, D-66421 Homburg/Saar, Germany;
| | - Abbas Rahdar
- Department of Physics, Faculty of Science, University of Zabol, Zabol 538-98615, Iran
| | - Magali Cucchiarini
- Center of Experimental Orthopaedics, Saarland University Medical Center, D-66421 Homburg/Saar, Germany;
| |
Collapse
|
7
|
Upton BA, Díaz NM, Gordon SA, Van Gelder RN, Buhr ED, Lang RA. Evolutionary Constraint on Visual and Nonvisual Mammalian Opsins. J Biol Rhythms 2021; 36:109-126. [PMID: 33765865 PMCID: PMC8058843 DOI: 10.1177/0748730421999870] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Animals have evolved light-sensitive G protein-coupled receptors, known as opsins, to detect coherent and ambient light for visual and nonvisual functions. These opsins have evolved to satisfy the particular lighting niches of the organisms that express them. While many unique patterns of evolution have been identified in mammals for rod and cone opsins, far less is known about the atypical mammalian opsins. Using genomic data from over 400 mammalian species from 22 orders, unique patterns of evolution for each mammalian opsins were identified, including photoisomerases, RGR-opsin (RGR) and peropsin (RRH), as well as atypical opsins, encephalopsin (OPN3), melanopsin (OPN4), and neuropsin (OPN5). The results demonstrate that OPN5 and rhodopsin show extreme conservation across all mammalian lineages. The cone opsins, SWS1 and LWS, and the nonvisual opsins, OPN3 and RRH, demonstrate a moderate degree of sequence conservation relative to other opsins, with some instances of lineage-specific gene loss. Finally, the photoisomerase, RGR, and the best-studied atypical opsin, OPN4, have high sequence diversity within mammals. These conservation patterns are maintained in human populations. Importantly, all mammalian opsins retain key amino acid residues important for conjugation to retinal-based chromophores, permitting light sensitivity. These patterns of evolution are discussed along with known functions of each atypical opsin, such as in circadian or metabolic physiology, to provide insight into the observed patterns of evolutionary constraint.
Collapse
Affiliation(s)
- Brian A. Upton
- Visual Systems Group, Abrahamson Pediatric Eye Institute, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
- Center for Chronobiology, Division of Pediatric Ophthalmology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
- Molecular & Developmental Biology Graduate Program, University of Cincinnati College of Medicine, Cincinnati, Ohio
- Medical Scientist Training Program, University of Cincinnati College of Medicine, Cincinnati, Ohio
| | - Nicolás M. Díaz
- Department of Ophthalmology, University of Washington School of Medicine, Seattle, Washington
| | - Shannon A. Gordon
- Department of Ophthalmology, University of Washington School of Medicine, Seattle, Washington
| | - Russell N. Van Gelder
- Department of Ophthalmology, University of Washington School of Medicine, Seattle, Washington
- Departments of Biological Structure and Laboratory Medicine and Pathology, University of Washington School of Medicine, Seattle, Washington
| | - Ethan D. Buhr
- Department of Ophthalmology, University of Washington School of Medicine, Seattle, Washington
| | - Richard A. Lang
- Visual Systems Group, Abrahamson Pediatric Eye Institute, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
- Center for Chronobiology, Division of Pediatric Ophthalmology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
- Division of Developmental Biology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
- Department of Ophthalmology, University of Cincinnati College of Medicine, Cincinnati, Ohio
| |
Collapse
|
8
|
Wylezich C, Caccio SM, Walochnik J, Beer M, Höper D. Untargeted metagenomics shows a reliable performance for synchronous detection of parasites. Parasitol Res 2020; 119:2623-2629. [PMID: 32591865 PMCID: PMC7366571 DOI: 10.1007/s00436-020-06754-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Accepted: 06/03/2020] [Indexed: 12/17/2022]
Abstract
Shotgun metagenomics with high-throughput sequencing (HTS) techniques is increasingly used for pathogen identification and characterization. While many studies apply targeted amplicon sequencing, here we used untargeted metagenomics to simultaneously identify protists and helminths in pre-diagnosed faecal and tissue samples. The approach starts from RNA and operates without an amplification step, therefore allowing the detection of all eukaryotes, including pathogens, since it circumvents the bias typically observed in amplicon-based HTS approaches. The generated metagenomics datasets were analysed using the RIEMS tool for initial taxonomic read assignment. Mapping analyses against ribosomal reference sequences were subsequently applied to extract 18S rRNA sequences abundantly present in the sequence datasets. The original diagnosis, which was based on microscopy and/or PCR, could be confirmed in nearly all cases using ribosomal RNA metagenomics. In addition to the pre-diagnosed taxa, we detected other intestinal eukaryotic parasites of uncertain pathogenicity (of the genera Dientamoeba, Entamoeba, Endolimax, Hymenolepis) that are often excluded from routine diagnostic protocols. The study clearly demonstrates the applicability of untargeted RNA metagenomics for the parallel detection of parasites.
Collapse
Affiliation(s)
- Claudia Wylezich
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Südufer 10, 17493, Greifswald-Insel Riems, Germany.
| | - Simone M Caccio
- Department of Infectious Diseases, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161, Rome, Italy
| | - Julia Walochnik
- Molecular Parasitology, Institute for Specific Prophylaxis and Tropical Medicine, Medical University of Vienna, Kinderspitalgasse 15, 1090, Vienna, Austria
| | - Martin Beer
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Südufer 10, 17493, Greifswald-Insel Riems, Germany
| | - Dirk Höper
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Südufer 10, 17493, Greifswald-Insel Riems, Germany
| |
Collapse
|
9
|
Ghaoui R, Needham M. Investigation of hereditary muscle disorders in the genomic era. ADVANCES IN CLINICAL NEUROSCIENCE & REHABILITATION 2020. [DOI: 10.47795/ayyz8676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
10
|
Li J, Jew B, Zhan L, Hwang S, Coppola G, Freimer NB, Sul JH. ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest. PLoS Comput Biol 2019; 15:e1007556. [PMID: 31851693 PMCID: PMC6938691 DOI: 10.1371/journal.pcbi.1007556] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2019] [Revised: 01/01/2020] [Accepted: 11/21/2019] [Indexed: 12/30/2022] Open
Abstract
Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our software uses the information on sequencing quality, such as sequencing depth, genotyping quality, and GC contents, to predict whether a particular variant is likely to be false-positive. To evaluate ForestQC, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that ForestQC outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. ForestQC is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is a practical approach to perform quality control on genetic variants from sequencing data.
Collapse
Affiliation(s)
- Jiajin Li
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, United States of America
| | - Brandon Jew
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA, United States of America
| | - Lingyu Zhan
- Molecular Biology Institute, University of California, Los Angeles, Los Angeles, CA, United States of America
| | - Sungoo Hwang
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, Los Angeles, CA, United States of America
| | - Giovanni Coppola
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, Los Angeles, CA, United States of America
| | - Nelson B. Freimer
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, United States of America
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, Los Angeles, CA, United States of America
| | - Jae Hoon Sul
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, Los Angeles, CA, United States of America
| |
Collapse
|
11
|
Hu T, Kruszka P, Martinez AF, Ming JE, Shabason EK, Raam MS, Shaikh TH, Pineda-Alvarez DE, Muenke M. Cytogenetics and holoprosencephaly: A chromosomal microarray study of 222 individuals with holoprosencephaly. AMERICAN JOURNAL OF MEDICAL GENETICS PART C-SEMINARS IN MEDICAL GENETICS 2019; 178:175-186. [PMID: 30182442 DOI: 10.1002/ajmg.c.31622] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2018] [Revised: 04/18/2018] [Accepted: 04/20/2018] [Indexed: 11/08/2022]
Abstract
Holoprosencephaly (HPE), a common developmental forebrain malformation, is characterized by failure of the cerebrum to completely divide into left and right hemispheres. The etiology of HPE is heterogeneous and a number of environmental and genetic factors have been identified. Cytogenetically visible alterations occur in 25% to 45% of HPE patients and cytogenetic techniques have long been used to study copy number variants (CNVs) in this disorder. The karyotype approach initially demonstrated several recurrent chromosomal anomalies, which led to the identification of HPE-specific loci and, eventually, several major HPE genes. More recently, higher-resolution cytogenetic techniques such as subtelomeric multiplex ligation-dependent probe amplification and chromosomal microarray have been used to analyze chromosomal anomalies. By using chromosomal microarray, we sought to identify submicroscopic chromosomal deletions and duplications in patients with HPE. In an analysis of 222 individuals with HPE, a deletion or duplication was detected in 107 individuals. Of these 107 individuals, 23 (21%) had variants that were classified as pathogenic or likely pathogenic by board-certified medical geneticists. We identified multiple patients with deletions in established HPE loci as well as three patients with deletions encompassed by 6q12-q14.3, a CNV previously reported by Bendavid et al. In addition, we identified a new locus, 16p13.2 that warrants further investigation for HPE association. Incidentally, we also found a case of Potocki-Lupski syndrome, a case of Phelan-McDermid syndrome, and multiple cases of 22q11.2 deletion syndrome within our cohort. These data confirm the genetically heterogeneous nature of HPE, and also demonstrate clinical utility of chromosomal microarray in diagnosing patients affected by HPE.
Collapse
Affiliation(s)
- Tommy Hu
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
| | - Paul Kruszka
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
| | - Ariel F Martinez
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
| | - Jeffrey E Ming
- Division of Human Genetics, The Children's Hospital of Philadelphia, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania
| | - Emily K Shabason
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland.,Division of Developmental and Behavioral Pediatrics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania
| | - Manu S Raam
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland.,General Pediatrics Services Shriners for Children Medical Center, Pasadena, California.,General Pediatrics Services Children's Hospital Los Angeles, Los Angeles, California
| | - Tamim H Shaikh
- Department of Pediatrics, University of Colorado School of Medicine, Aurora, Colorado.,Invitae Corporation, San Francisco, California
| | - Daniel E Pineda-Alvarez
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland.,Division of Developmental and Behavioral Pediatrics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania
| | - Maximilian Muenke
- Medical Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
| |
Collapse
|
12
|
Zwane AA, Schnabel RD, Hoff J, Choudhury A, Makgahlela ML, Maiwashe A, Van Marle-Koster E, Taylor JF. Genome-Wide SNP Discovery in Indigenous Cattle Breeds of South Africa. Front Genet 2019; 10:273. [PMID: 30988672 PMCID: PMC6452414 DOI: 10.3389/fgene.2019.00273] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Accepted: 03/12/2019] [Indexed: 01/30/2023] Open
Abstract
Single nucleotide polymorphism arrays have created new possibilities for performing genome-wide studies to detect genomic regions harboring sequence variants that affect complex traits. However, the majority of validated SNPs for which allele frequencies have been estimated are limited primarily to European breeds. The objective of this study was to perform SNP discovery in three South African indigenous breeds (Afrikaner, Drakensberger, and Nguni) using whole genome sequencing. DNA was extracted from blood and hair samples, quantified and prepared at 50 ng/μl concentration for sequencing at the Agricultural Research Council Biotechnology Platform using an Illumina HiSeq 2500. The fastq files were used to call the variants using the Genome Analysis Tool Kit. A total of 1,678,360 were identified as novel using Run 6 of 1000 Bull Genomes Project. Annotation of the identified variants classified them into functional categories. Within the coding regions, about 30% of the SNPs were non-synonymous substitutions that encode for alternate amino acids. The study of distribution of SNP across the genome identified regions showing notable differences in the densities of SNPs among the breeds and highlighted many regions of functional significance. Gene ontology terms identified genes such as MLANA, SYT10, and CDC42EP5 that have been associated with coat color in mouse, and ADAMS3, DNAJC3, and PAG5 genes have been associated with fertility in cattle. Further analysis of the variants detected 688 candidate selective sweeps (ZHp Z-scores ≤ -4) across all three breeds, of which 223 regions were assigned as being putative selective sweeps (ZHp scores ≤-5). We also identified 96 regions with extremely low ZHp Z-scores (≤-6) in Afrikaner and Nguni. Genes such as KIT and MITF that have been associated with skin pigmentation in cattle and CACNA1C, which has been associated with biopolar disorder in human, were identified in these regions. This study provides the first analysis of sequence data to discover SNPs in indigenous South African cattle breeds. The information will play an important role in our efforts to understand the genetic history of our cattle and in designing appropriate breed improvement programmes.
Collapse
Affiliation(s)
- Avhashoni A. Zwane
- Department of Animal Breeding and Genetics, Agricultural Research Council-Animal Production, Irene, South Africa
- Department of Animal and Wildlife Sciences, University of Pretoria, Pretoria, South Africa
| | - Robert D. Schnabel
- Division of Animal Sciences, University of Missouri, Columbia, MO, United States
- Informatics Institute, University of Missouri, Columbia, MO, United States
| | - Jesse Hoff
- Division of Animal Sciences, University of Missouri, Columbia, MO, United States
| | - Ananyo Choudhury
- Sydney Brenner Institute of Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa
| | - Mahlako Linah Makgahlela
- Department of Animal Breeding and Genetics, Agricultural Research Council-Animal Production, Irene, South Africa
- Department of Animal, Wildlife and Grassland Sciences, University of the Free State, Bloemfontein, South Africa
| | - Azwihangwisi Maiwashe
- Department of Animal Breeding and Genetics, Agricultural Research Council-Animal Production, Irene, South Africa
- Department of Animal, Wildlife and Grassland Sciences, University of the Free State, Bloemfontein, South Africa
| | - Este Van Marle-Koster
- Department of Animal and Wildlife Sciences, University of Pretoria, Pretoria, South Africa
| | - Jeremy F. Taylor
- Division of Animal Sciences, University of Missouri, Columbia, MO, United States
| |
Collapse
|
13
|
Pan B, Kusko R, Xiao W, Zheng Y, Liu Z, Xiao C, Sakkiah S, Guo W, Gong P, Zhang C, Ge W, Shi L, Tong W, Hong H. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinformatics 2019; 20:101. [PMID: 30871461 PMCID: PMC6419332 DOI: 10.1186/s12859-019-2620-0] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Background Reference genome selection is a prerequisite for successful analysis of next generation sequencing (NGS) data. Current practice employs one of the two most recent human reference genome versions: HG19 or HG38. To date, the impact of genome version on SNV identification has not been rigorously assessed. Methods We conducted analysis comparing the SNVs identified based on HG19 vs HG38, leveraging whole genome sequencing (WGS) data from the genome-in-a-bottle (GIAB) project. First, SNVs were called using 26 different bioinformatics pipelines with either HG19 or HG38. Next, two tools were used to convert the called SNVs between HG19 and HG38. Lastly we calculated conversion rates, analyzed discordant rates between SNVs called with HG19 or HG38, and characterized the discordant SNVs. Results The conversion rates from HG38 to HG19 (average 95%) were lower than the conversion rates from HG19 to HG38 (average 99%). The conversion rates varied slightly among the various calling pipelines. Around 1.5% SNVs were discordantly converted between HG19 or HG38. The conversions from HG38 to HG19 had more SNVs which failed conversion and more discordant SNVs than the opposite conversion (HG19 to HG38). Most of the discordant SNVs had low read depth, were low confidence SNVs as defined by GIAB, and/or were predominated by G/C alleles (52% observed versus 42% expected). Conclusion A significant number of SNVs could not be converted between HG19 and HG38. Based on careful review of our comparisons, we recommend HG38 (the newer version) for NGS SNV analysis. To summarize, our findings suggest caution when translating identified SNVs between different versions of the human reference genome. Electronic supplementary material The online version of this article (10.1186/s12859-019-2620-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Bohu Pan
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | | | - Wenming Xiao
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Yuanting Zheng
- Center for Pharmacogenomics, Fudan University, Shanghai, China
| | - Zhichao Liu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Chunlin Xiao
- National Center for Biotechnological Information, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Sugunadevi Sakkiah
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Wenjing Guo
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Ping Gong
- Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA
| | - Chaoyang Zhang
- School of Computing, The University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Weigong Ge
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Leming Shi
- Center for Pharmacogenomics, Fudan University, Shanghai, China
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA.
| |
Collapse
|
14
|
Cheng H, Yang X, Si H, Saleh AD, Xiao W, Coupar J, Gollin SM, Ferris RL, Issaeva N, Yarbrough WG, Prince ME, Carey TE, Van Waes C, Chen Z. Genomic and Transcriptomic Characterization Links Cell Lines with Aggressive Head and Neck Cancers. Cell Rep 2018; 25:1332-1345.e5. [PMID: 30380422 PMCID: PMC6280671 DOI: 10.1016/j.celrep.2018.10.007] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Revised: 08/28/2018] [Accepted: 09/28/2018] [Indexed: 12/12/2022] Open
Abstract
Cell lines are important tools for biological and preclinical investigation, and establishing their relationship to genomic alterations in tumors could accelerate functional and therapeutic discoveries. We conducted integrated analyses of genomic and transcriptomic profiles of 15 human papillomavirus (HPV)-negative and 11 HPV-positive head and neck squamous cell carcinoma (HNSCC) lines to compare with 279 tumors from The Cancer Genome Atlas (TCGA). We identified recurrent amplifications on chromosomes 3q22-29, 5p15, 11q13/22, and 8p11 that drive increased expression of more than 100 genes in cell lines and tumors. These alterations, together with loss or mutations of tumor suppressor genes, converge on important signaling pathways, recapitulating the genomic landscape of aggressive HNSCCs. Among these, concurrent 3q26.3 amplification and TP53 mutation in most HPV(-) cell lines reflect tumors with worse survival. Our findings elucidate and validate genomic alterations underpinning numerous discoveries made with HNSCC lines and provide valuable models for future studies.
Collapse
Affiliation(s)
- Hui Cheng
- Tumor Biology Section, Head and Neck Surgery Branch, National Institute on Deafness and Other Communication Disorders, NIH, Bethesda, MD 20892, USA
| | - Xinping Yang
- Tumor Biology Section, Head and Neck Surgery Branch, National Institute on Deafness and Other Communication Disorders, NIH, Bethesda, MD 20892, USA
| | - Han Si
- Translational Bioinformatics, MedImmune, Gaithersburg, MD 20878, USA
| | - Anthony D Saleh
- Tumor Biology Section, Head and Neck Surgery Branch, National Institute on Deafness and Other Communication Disorders, NIH, Bethesda, MD 20892, USA
| | - Wenming Xiao
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
| | - Jamie Coupar
- Tumor Biology Section, Head and Neck Surgery Branch, National Institute on Deafness and Other Communication Disorders, NIH, Bethesda, MD 20892, USA
| | - Susanne M Gollin
- Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Robert L Ferris
- Division of Head and Neck Surgery, Departments of Otolaryngology, Radiation Oncology, and Immunology, University of Pittsburgh Cancer Institute, Pittsburgh, PA 15232, USA
| | - Natalia Issaeva
- Department of Surgery, Division of Otolaryngology, Molecular Virology Research Program, Smilow Cancer Hospital, Yale Cancer Center, Yale University Medical School, New Haven, CT 06520, USA
| | - Wendell G Yarbrough
- Department of Surgery, Division of Otolaryngology, Molecular Virology Research Program, Smilow Cancer Hospital, Yale Cancer Center, Yale University Medical School, New Haven, CT 06520, USA
| | - Mark E Prince
- Cancer Biology Program, Program in the Biomedical Sciences, Rackham Graduate School, and the Department of Otolaryngology-Head and Neck Surgery, University of Michigan, Ann Arbor, MI 48109, USA
| | - Thomas E Carey
- Cancer Biology Program, Program in the Biomedical Sciences, Rackham Graduate School, and the Department of Otolaryngology-Head and Neck Surgery, University of Michigan, Ann Arbor, MI 48109, USA
| | - Carter Van Waes
- Tumor Biology Section, Head and Neck Surgery Branch, National Institute on Deafness and Other Communication Disorders, NIH, Bethesda, MD 20892, USA.
| | - Zhong Chen
- Tumor Biology Section, Head and Neck Surgery Branch, National Institute on Deafness and Other Communication Disorders, NIH, Bethesda, MD 20892, USA.
| |
Collapse
|
15
|
Murgai AA, Jog MS. Can heterozygotes of autosomal recessive disorders have clinical manifestations? Mov Disord 2018; 33:1368-1369. [DOI: 10.1002/mds.27394] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2018] [Revised: 03/07/2018] [Accepted: 03/11/2018] [Indexed: 11/10/2022] Open
Affiliation(s)
- Aditya A. Murgai
- Department of Clinical Neurological Sciences; Western University; London Ontario Canada
| | - Mandar S. Jog
- Department of Clinical Neurological Sciences; Western University; London Ontario Canada
| |
Collapse
|
16
|
Zhang G, Zhang T, Liu J, Zhang J, He C. Comprehensive analysis of differentially expressed genes reveals the molecular response to elevated CO 2 levels in two sea buckthorn cultivars. Gene 2018; 660:120-127. [DOI: 10.1016/j.gene.2018.03.057] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2017] [Revised: 03/05/2018] [Accepted: 03/16/2018] [Indexed: 01/08/2023]
|
17
|
Bronstein O, Kroh A, Haring E. Mind the gap! The mitochondrial control region and its power as a phylogenetic marker in echinoids. BMC Evol Biol 2018; 18:80. [PMID: 29848319 PMCID: PMC5977486 DOI: 10.1186/s12862-018-1198-x] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2018] [Accepted: 05/18/2018] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND In Metazoa, mitochondrial markers are the most commonly used targets for inferring species-level molecular phylogenies due to their extremely low rate of recombination, maternal inheritance, ease of use and fast substitution rate in comparison to nuclear DNA. The mitochondrial control region (CR) is the main non-coding area of the mitochondrial genome and contains the mitochondrial origin of replication and transcription. While sequences of the cytochrome oxidase subunit 1 (COI) and 16S rRNA genes are the prime mitochondrial markers in phylogenetic studies, the highly variable CR is typically ignored and not targeted in such analyses. However, the higher substitution rate of the CR can be harnessed to infer the phylogeny of closely related species, and the use of a non-coding region alleviates biases resulting from both directional and purifying selection. Additionally, complete mitochondrial genome assemblies utilizing next generation sequencing (NGS) data often show exceptionally low coverage at specific regions, including the CR. This can only be resolved by targeted sequencing of this region. RESULTS Here we provide novel sequence data for the echinoid mitochondrial control region in over 40 species across the echinoid phylogenetic tree. We demonstrate the advantages of directly targeting the CR and adjacent tRNAs to facilitate complementing low coverage NGS data from complete mitochondrial genome assemblies. Finally, we test the performance of this region as a phylogenetic marker both in the lab and in phylogenetic analyses, and demonstrate its superior performance over the other available mitochondrial markers in echinoids. CONCLUSIONS Our target region of the mitochondrial CR (1) facilitates the first thorough investigation of this region across a wide range of echinoid taxa, (2) provides a tool for complementing missing data in NGS experiments, and (3) identifies the CR as a powerful, novel marker for phylogenetic inference in echinoids due to its high variability, lack of selection, and high compatibility across the entire class, outperforming conventional mitochondrial markers.
Collapse
Affiliation(s)
- Omri Bronstein
- Natural History Museum Vienna, Geological-Palaeontological Department, 1010 Vienna, Austria
- Natural History Museum Vienna, Central Research Laboratories, 1010 Vienna, Austria
| | - Andreas Kroh
- Natural History Museum Vienna, Geological-Palaeontological Department, 1010 Vienna, Austria
| | - Elisabeth Haring
- Natural History Museum Vienna, Central Research Laboratories, 1010 Vienna, Austria
- Department of Integrative Zoology, University of Vienna, Vienna, Austria
| |
Collapse
|
18
|
Comparison of single cell sequencing data between two whole genome amplification methods on two sequencing platforms. Sci Rep 2018; 8:4963. [PMID: 29563514 PMCID: PMC5862989 DOI: 10.1038/s41598-018-23325-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Accepted: 03/08/2018] [Indexed: 01/28/2023] Open
Abstract
Research based on a strategy of single-cell low-coverage whole genome sequencing (SLWGS) has enabled better reproducibility and accuracy for detection of copy number variations (CNVs). The whole genome amplification (WGA) method and sequencing platform are critical factors for successful SLWGS (<0.1 × coverage). In this study, we compared single cell and multiple cells sequencing data produced by the HiSeq2000 and Ion Proton platforms using two WGA kits and then comprehensively evaluated the GC-bias, reproducibility, uniformity and CNV detection among different experimental combinations. Our analysis demonstrated that the PicoPLEX WGA Kit resulted in higher reproducibility, lower sequencing error frequency but more GC-bias than the GenomePlex Single Cell WGA Kit (WGA4 kit) independent of the cell number on the HiSeq2000 platform. While on the Ion Proton platform, the WGA4 kit (both single cell and multiple cells) had higher uniformity and less GC-bias but lower reproducibility than those of the PicoPLEX WGA Kit. Moreover, on these two sequencing platforms, depending on cell number, the performance of the two WGA kits was different for both sensitivity and specificity on CNV detection. The results can help researchers who plan to use SLWGS on single or multiple cells to select appropriate experimental conditions for their applications.
Collapse
|
19
|
Shea DJ, Shimizu M, Nishida N, Fukai E, Abe T, Fujimoto R, Okazaki K. IntroMap: a signal analysis based method for the detection of genomic introgressions. BMC Genet 2017; 18:101. [PMID: 29202713 PMCID: PMC5716257 DOI: 10.1186/s12863-017-0568-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2017] [Accepted: 11/14/2017] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Breeding programs often rely on marker-assisted tests or variant calling of next generation sequence (NGS) data to identify regions of genomic introgression arising from the hybridization of two plant species. In this paper we present IntroMap, a bioinformatics pipeline for the screening of candidate plants through the application of signal processing techniques to NGS data, using alignment to a reference genome sequence (annotation is not required) that shares homology with the recurrent parental cultivar, and without the need for de novo assembly of the read data or variant calling. RESULTS We show the accurate identification of introgressed genomic regions using both in silico simulated genomes, and a hybridized cultivar data set using our pipeline. Additionally we show, through targeted marker-based assays, validation of the IntroMap predicted regions for the hybrid cultivar. CONCLUSIONS This approach can be used to automate the screening of large populations, reducing the time and labor required, and can improve the accuracy of the detection of introgressed regions in comparison to a marker-based approach. In contrast to other approaches that generally rely upon a variant calling step, our method achieves accurate identification of introgressed regions without variant calling, relying solely upon alignment.
Collapse
Affiliation(s)
- Daniel J Shea
- Laboratory of Plant Breeding, Graduate School of Science and Technology, Niigata University, Ikarashi-ninocho, Niigata, 950-2181, Japan
| | - Motoki Shimizu
- Iwate Biotechnology Research Center, Narita, Kitakami, 024-0003, Japan
| | - Namiko Nishida
- Graduate School of Agricultural Science, Kobe University, Rokkodai, Nada-ku, Kobe, 657-8501, Japan
| | - Eigo Fukai
- Laboratory of Plant Breeding, Graduate School of Science and Technology, Niigata University, Ikarashi-ninocho, Niigata, 950-2181, Japan
| | - Takashi Abe
- Department of Computer Science, Graduate School of Science and Technology, Niigata University, Ikarashi-ninocho, Niigata, 950-2181, Japan
| | - Ryo Fujimoto
- Graduate School of Agricultural Science, Kobe University, Rokkodai, Nada-ku, Kobe, 657-8501, Japan
| | - Keiichi Okazaki
- Laboratory of Plant Breeding, Graduate School of Science and Technology, Niigata University, Ikarashi-ninocho, Niigata, 950-2181, Japan.
| |
Collapse
|
20
|
Huang Y, Yu F, Li X, Luo L, Wu J, Yang Y, Deng Z, Chen R, Zhang M. Comparative genetic analysis of the 45S rDNA intergenic spacers from three Saccharum species. PLoS One 2017; 12:e0183447. [PMID: 28817651 PMCID: PMC5560572 DOI: 10.1371/journal.pone.0183447] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2016] [Accepted: 08/06/2017] [Indexed: 12/12/2022] Open
Abstract
The 45S ribosomal DNA (rDNA) units are separated by an intergenic spacer (IGS) containing the signals for transcription and processing of rRNAs. For the first time, we sequenced and analyzed the entire IGS region from three original species within the genus Saccharum, including S. spontaneum, S. robustum, and S. officinarum in this study. We have compared the IGS organization within three original species of the genus Saccharum. The IGS of these three original species showed similar overall organizations comprised of putative functional elements needed for rRNA gene activity as well as a non-transcribed spacer (NTS), a promoter region, and an external transcribed spacer (ETS). The variability in length of the IGS sequences was assessed at the individual, intraspecies, and interspecies levels of the genus Saccharum, including S. spontaneum, S. robustum, and S. officinarum. The ETS had greater similarity than the NTS across species, but nevertheless exhibited variation in length. Within the IGS of the Saccharum species, base substitutions and copy number variation of sub-repeat were causes of the divergence in IGS sequences. We also identified a significant number of methylation sites. Furthermore, fluorescent in situ hybridization (FISH) co-localization of IGS and pTa71 probes was detected on all representative species of the genus Saccharum tested. Taken together, the results of this study provide a better insight into the structure and organization of the IGS in the genus Saccharum.
Collapse
Affiliation(s)
- Yongji Huang
- Key Lab of Sugarcane Biology and Genetic Breeding, Ministry of Agriculture, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Fan Yu
- Key Lab of Sugarcane Biology and Genetic Breeding, Ministry of Agriculture, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Xueting Li
- Key Lab of Sugarcane Biology and Genetic Breeding, Ministry of Agriculture, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Ling Luo
- Key Lab of Sugarcane Biology and Genetic Breeding, Ministry of Agriculture, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Jiayun Wu
- Guangdong Key Laboratory of Sugarcane Improvement and Biorefinery, Guangzhou, China
- Guangdong Provincial Bioengineering Institute, Guangzhou Sugarcane Industry Research Institute, Guangzhou, China
| | - Yongqing Yang
- Key Lab of Sugarcane Biology and Genetic Breeding, Ministry of Agriculture, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Zuhu Deng
- Key Lab of Sugarcane Biology and Genetic Breeding, Ministry of Agriculture, Fujian Agriculture and Forestry University, Fuzhou, China
- Guangxi Collaborative Innovation Center of Sugar Industries, Guangxi University, Nanning, China
- * E-mail:
| | - Rukai Chen
- Key Lab of Sugarcane Biology and Genetic Breeding, Ministry of Agriculture, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Muqing Zhang
- Guangxi Collaborative Innovation Center of Sugar Industries, Guangxi University, Nanning, China
| |
Collapse
|
21
|
Characterization of porcine simple sequence repeat variation on a population scale with genome resequencing data. Sci Rep 2017; 7:2376. [PMID: 28539617 PMCID: PMC5443785 DOI: 10.1038/s41598-017-02600-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2017] [Accepted: 04/13/2017] [Indexed: 12/23/2022] Open
Abstract
Simple sequence repeats (SSRs) are used as polymorphic molecular markers in many species. They contribute very important functional variations in a range of complex traits; however, little is known about the variation of most SSRs in pig populations. Here, using genome resequencing data, we identified ~0.63 million polymorphic SSR loci from more than 100 individuals. Through intensive analysis of this dataset, we found that the SSR motif composition, motif length, total length of alleles and distribution of alleles all contribute to SSR variability. Furthermore, we found that CG-containing SSRs displayed significantly lower polymorphism and higher cross-species conservation. With a rigorous filter procedure, we provided a catalogue of 16,527 high-quality polymorphic SSRs, which displayed reliable results for the analysis of phylogenetic relationships and provided valuable summary statistics for 30 individuals equally selected from eight local Chinese pig breeds, six commercial lean pig breeds and Chinese wild boars. In addition, from the high-quality polymorphic SSR catalogue, we identified four loci with potential loss-of-function alleles. Overall, these analyses provide a valuable catalogue of polymorphic SSRs to the existing pig genetic variation database, and we believe this catalogue could be used for future genome-wide genetic analysis.
Collapse
|
22
|
Grüll MP, Peña-Castillo L, Mulligan ME, Lang AS. Genome-wide identification and characterization of small RNAs in Rhodobacter capsulatus and identification of small RNAs affected by loss of the response regulator CtrA. RNA Biol 2017; 14:914-925. [PMID: 28296577 PMCID: PMC5546546 DOI: 10.1080/15476286.2017.1306175] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/30/2022] Open
Abstract
Small non-coding RNAs (sRNAs) are involved in the control of numerous cellular processes through various regulatory mechanisms, and in the past decade many studies have identified sRNAs in a multitude of bacterial species using RNA sequencing (RNA-seq). Here, we present the first genome-wide analysis of sRNA sequencing data in Rhodobacter capsulatus, a purple nonsulfur photosynthetic alphaproteobacterium. Using a recently developed bioinformatics approach, sRNA-Detect, we detected 422 putative sRNAs from R. capsulatus RNA-seq data. Based on their sequence similarity to sRNAs in a sRNA collection, consisting of published putative sRNAs from 23 additional bacterial species, and RNA databases, the sequences of 124 putative sRNAs were conserved in at least one other bacterial species; and, 19 putative sRNAs were assigned a predicted function. We bioinformatically characterized all putative sRNAs and applied machine learning approaches to calculate the probability of a nucleotide sequence to be a bona fide sRNA. The resulting quantitative model was able to correctly classify 95.2% of sequences in a validation set. We found that putative cis-targets for antisense and partially overlapping sRNAs were enriched with protein-coding genes involved in primary metabolic processes, photosynthesis, compound binding, and with genes forming part of macromolecular complexes. We performed differential expression analysis to compare the wild type strain to a mutant lacking the response regulator CtrA, an important regulator of gene expression in R. capsulatus, and identified 18 putative sRNAs with differing levels in the two strains. Finally, we validated the existence and expression patterns of four novel sRNAs by Northern blot analysis.
Collapse
Affiliation(s)
- Marc P Grüll
- a Department of Biology , Memorial University of Newfoundland , St. John's , NL , Canada
| | - Lourdes Peña-Castillo
- a Department of Biology , Memorial University of Newfoundland , St. John's , NL , Canada.,b Department of Computer Science , Memorial University of Newfoundland , St. John's , NL , Canada
| | - Martin E Mulligan
- c Department of Biochemistry , Memorial University of Newfoundland , St. John's , NL , Canada
| | - Andrew S Lang
- a Department of Biology , Memorial University of Newfoundland , St. John's , NL , Canada
| |
Collapse
|
23
|
Abstract
The vast majority of somatic variants in cancer genomes occur in non-coding regions. However, progress in cancer genomics in the past decade has been mostly focused on coding regions, largely due to the prohibitive cost of whole genome sequencing (WGS). Recent technological advances have decreased sequencing costs leading to the current acquisition of thousands of tumor whole genome sequences which has led to a hunt for non-coding drivers. The most well characterized regulatory drivers are in the TERT promoter and have been identified in many cancer types. Despite the larger fraction of somatic variants occurring in non-coding regions, the number of non-coding drivers identified so far is much less than the number of coding region drivers. Here we discuss reasons that may hinder the detection of non-coding drivers. We also examine the relationship between non-coding genetic variation and epigenetic state in tumor cells and assert the need for additional epigenetic data sets as a prerequisite for understanding the rewiring of regulatory networks in cancer.
Collapse
|
24
|
Castelli EC, Gerasimou P, Paz MA, Ramalho J, Porto IO, Lima TH, Souza AS, Veiga-Castelli LC, Collares CV, Donadi EA, Mendes-Junior CT, Costeas P. HLA-G variability and haplotypes detected by massively parallel sequencing procedures in the geographicaly distinct population samples of Brazil and Cyprus. Mol Immunol 2017; 83:115-126. [DOI: 10.1016/j.molimm.2017.01.020] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2016] [Revised: 01/18/2017] [Accepted: 01/20/2017] [Indexed: 12/11/2022]
|
25
|
Tran Q, Gao S, Phan V. Analysis of optimal alignments unfolds aligners' bias in existing variant profiles. BMC Bioinformatics 2016; 17:349. [PMID: 27766935 PMCID: PMC5073887 DOI: 10.1186/s12859-016-1216-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Efforts such as International HapMap Project and 1000 Genomes Project resulted in a catalog of millions of single nucleotides and insertion/deletion (INDEL) variants of the human population. Viewed as a reference of existing variants, this resource commonly serves as a gold standard for studying and developing methods to detect genetic variants. Our analysis revealed that this reference contained thousands of INDELs that were constructed in a biased manner. This bias occurred at the level of aligning short reads to reference genomes to detect variants. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that agreed strongly or disagreed strongly with reported INDELs. This finding suggests that the agreement or disagreement between the aligners’ called INDEL and the reported INDEL is merely a result of the arbitrary selection of one of the optimal alignments. The existence of bias in INDEL calling might have a serious influence in downstream analyses. As such, our finding suggests that this phenomenon should be further addressed.
Collapse
Affiliation(s)
- Quang Tran
- Department of Computer Science, University of Memphis, Memphis, 38152, TN, USA
| | - Shanshan Gao
- Department of Computer Science, University of Memphis, Memphis, 38152, TN, USA
| | - Vinhthuy Phan
- Department of Computer Science, University of Memphis, Memphis, 38152, TN, USA.
| |
Collapse
|
26
|
Pareek CS, Smoczyński R, Kadarmideen HN, Dziuba P, Błaszczyk P, Sikora M, Walendzik P, Grzybowski T, Pierzchała M, Horbańczuk J, Szostak A, Ogluszka M, Zwierzchowski L, Czarnik U, Fraser L, Sobiech P, Wąsowicz K, Gelfand B, Feng Y, Kumar D. Single Nucleotide Polymorphism Discovery in Bovine Pituitary Gland Using RNA-Seq Technology. PLoS One 2016; 11:e0161370. [PMID: 27606429 PMCID: PMC5015895 DOI: 10.1371/journal.pone.0161370] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2016] [Accepted: 08/04/2016] [Indexed: 01/14/2023] Open
Abstract
Examination of bovine pituitary gland transcriptome by strand-specific RNA-seq allows detection of putative single nucleotide polymorphisms (SNPs) within potential candidate genes (CGs) or QTLs regions as well as to understand the genomics variations that contribute to economic trait. Here we report a breed-specific model to successfully perform the detection of SNPs in the pituitary gland of young growing bulls representing Polish Holstein-Friesian (HF), Polish Red, and Hereford breeds at three developmental ages viz., six months, nine months, and twelve months. A total of 18 bovine pituitary gland polyA transcriptome libraries were prepared and sequenced using the Illumina NextSeq 500 platform. Sequenced FastQ databases of all 18 young bulls were submitted to NCBI-SRA database with NCBI-SRA accession numbers SRS1296732. For the investigated young bulls, a total of 113,882,3098 raw paired-end reads with a length of 156 bases were obtained, resulting in an approximately 63 million paired-end reads per library. Breed-wise, a total of 515.38, 215.39, and 408.04 million paired-end reads were obtained for Polish HF, Polish Red, and Hereford breeds, respectively. Burrows-Wheeler Aligner (BWA) read alignments showed 93.04%, 94.39%, and 83.46% of the mapped sequencing reads were properly paired to the Polish HF, Polish Red, and Hereford breeds, respectively. Constructed breed-specific SNP-db of three cattle breeds yielded at 13,775,885 SNPs. On an average 765,326 breed-specific SNPs per young bull were identified. Using two stringent filtering parameters, i.e., a minimum 10 SNP reads per base with an accuracy ≥ 90% and a minimum 10 SNP reads per base with an accuracy = 100%, SNP-db records were trimmed to construct a highly reliable SNP-db. This resulted in a reduction of 95,7% and 96,4% cut-off mark of constructed raw SNP-db. Finally, SNP discoveries using RNA-Seq data were validated by KASP™ SNP genotyping assay. The comprehensive QTLs/CGs analysis of 76 QTLs/CGs with RNA-seq data identified KCNIP4, CCSER1, DPP6, MAP3K5 and GHR CGs with highest SNPs hit loci in all three breeds and developmental ages. However, CAST CG with more than 100 SNPs hits were observed only in Polish HF and Hereford breeds.These findings are important for identification and construction of novel tissue specific SNP-db and breed specific SNP-db dataset by screening of putative SNPs according to QTL db and candidate genes for bovine growth and reproduction traits, one can develop genomic selection strategies for growth and reproductive traits.
Collapse
Affiliation(s)
- Chandra Shekhar Pareek
- Division of Functional Genomics in Biological and Biomedical Research, Centre for Modern Interdisciplinary Technologies, Nicolaus Copernicus University, Torun, Poland
- * E-mail:
| | - Rafał Smoczyński
- Division of Functional Genomics in Biological and Biomedical Research, Centre for Modern Interdisciplinary Technologies, Nicolaus Copernicus University, Torun, Poland
| | - Haja N. Kadarmideen
- Department of Large Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Piotr Dziuba
- Division of Functional Genomics in Biological and Biomedical Research, Centre for Modern Interdisciplinary Technologies, Nicolaus Copernicus University, Torun, Poland
| | - Paweł Błaszczyk
- Division of Functional Genomics in Biological and Biomedical Research, Centre for Modern Interdisciplinary Technologies, Nicolaus Copernicus University, Torun, Poland
| | - Marcin Sikora
- Division of Functional Genomics in Biological and Biomedical Research, Centre for Modern Interdisciplinary Technologies, Nicolaus Copernicus University, Torun, Poland
| | - Paulina Walendzik
- Division of Functional Genomics in Biological and Biomedical Research, Centre for Modern Interdisciplinary Technologies, Nicolaus Copernicus University, Torun, Poland
| | - Tomasz Grzybowski
- Ludwik Rydygier Collegium Medicum, Institute of Forensic Medicine, Department of Molecular and Forensic Genetics, The Nicolaus Copernicus University, Bydgoszcz, Poland
| | - Mariusz Pierzchała
- Institute of Genetics and Animal Breeding of the Polish Academy of Sciences, Jastrzebiec, Poland
| | - Jarosław Horbańczuk
- Institute of Genetics and Animal Breeding of the Polish Academy of Sciences, Jastrzebiec, Poland
| | - Agnieszka Szostak
- Institute of Genetics and Animal Breeding of the Polish Academy of Sciences, Jastrzebiec, Poland
| | - Magdalena Ogluszka
- Institute of Genetics and Animal Breeding of the Polish Academy of Sciences, Jastrzebiec, Poland
| | - Lech Zwierzchowski
- Institute of Genetics and Animal Breeding of the Polish Academy of Sciences, Jastrzebiec, Poland
| | - Urszula Czarnik
- Faculty of Animal Bio-engineering, University of Warmia and Mazury in Olsztyn, Olsztyn, Poland
| | - Leyland Fraser
- Faculty of Animal Bio-engineering, University of Warmia and Mazury in Olsztyn, Olsztyn, Poland
| | - Przemysław Sobiech
- Faculty of Veterinary Medicine, University of Warmia and Mazury in Olsztyn, Olsztyn, Poland
| | - Krzysztof Wąsowicz
- Faculty of Veterinary Medicine, University of Warmia and Mazury in Olsztyn, Olsztyn, Poland
| | - Brian Gelfand
- Waksman Institute of Microbiology, Rutgers, The State University of New Jersey, Piscataway, New Jersey, United States of America
| | - Yaping Feng
- Waksman Institute of Microbiology, Rutgers, The State University of New Jersey, Piscataway, New Jersey, United States of America
| | - Dibyendu Kumar
- Waksman Institute of Microbiology, Rutgers, The State University of New Jersey, Piscataway, New Jersey, United States of America
| |
Collapse
|
27
|
van der Weide RH, Simonis M, Hermsen R, Toonen P, Cuppen E, de Ligt J. The Genomic Scrapheap Challenge; Extracting Relevant Data from Unmapped Whole Genome Sequencing Reads, Including Strain Specific Genomic Segments, in Rats. PLoS One 2016; 11:e0160036. [PMID: 27501045 PMCID: PMC4976967 DOI: 10.1371/journal.pone.0160036] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2016] [Accepted: 07/12/2016] [Indexed: 01/17/2023] Open
Abstract
Unmapped next-generation sequencing reads are typically ignored while they contain biologically relevant information. We systematically analyzed unmapped reads from whole genome sequencing of 33 inbred rat strains. High quality reads were selected and enriched for biologically relevant sequences; similarity-based analysis revealed clustering similar to previously reported phylogenetic trees. Our results demonstrate that on average 20% of all unmapped reads harbor sequences that can be used to improve reference genomes and generate hypotheses on potential genotype-phenotype relationships. Analysis pipelines would benefit from incorporating the described methods and reference genomes would benefit from inclusion of the genomic segments obtained through these efforts.
Collapse
Affiliation(s)
- Robin H. van der Weide
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
- Division of Gene Regulation, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Marieke Simonis
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Roel Hermsen
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Pim Toonen
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Edwin Cuppen
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Joep de Ligt
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
| |
Collapse
|
28
|
HLA-F coding and regulatory segments variability determined by massively parallel sequencing procedures in a Brazilian population sample. Hum Immunol 2016; 77:841-853. [PMID: 27448841 DOI: 10.1016/j.humimm.2016.07.231] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2016] [Revised: 07/18/2016] [Accepted: 07/19/2016] [Indexed: 12/30/2022]
Abstract
Human Leucocyte Antigen F (HLA-F) is a non-classical HLA class I gene distinguished from its classical counterparts by low allelic polymorphism and distinctive expression patterns. Its exact function remains unknown. It is believed that HLA-F has tolerogenic and immune modulatory properties. Currently, there is little information regarding the HLA-F allelic variation among human populations and the available studies have evaluated only a fraction of the HLA-F gene segment and/or have searched for known alleles only. Here we present a strategy to evaluate the complete HLA-F variability including its 5' upstream, coding and 3' downstream segments by using massively parallel sequencing procedures. HLA-F variability was surveyed on 196 individuals from the Brazilian Southeast. The results indicate that the HLA-F gene is indeed conserved at the protein level, where thirty coding haplotypes or coding alleles were detected, encoding only four different HLA-F full-length protein molecules. Moreover, a same protein molecule is encoded by 82.45% of all coding alleles detected in this Brazilian population sample. However, the HLA-F nucleotide and haplotype variability is much higher than our current knowledge both in Brazilians and considering the 1000 Genomes Project data. This protein conservation is probably a consequence of the key role of HLA-F in the immune system physiology.
Collapse
|
29
|
De novo transcriptome sequencing and gene expression profiling of spinach (Spinacia oleracea L.) leaves under heat stress. Sci Rep 2016; 6:19473. [PMID: 26857466 PMCID: PMC4746569 DOI: 10.1038/srep19473] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2015] [Accepted: 12/09/2015] [Indexed: 02/08/2023] Open
Abstract
Spinach (Spinacia oleracea) has cold tolerant but heat sensitive characteristics. The spinach variety ‘Island,’ is suitable for summer periods. There is lack molecular information available for spinach in response to heat stress. In this study, high throughput de novo transcriptome sequencing and gene expression analyses were carried out at different spinach variety ‘Island’ leaves (grown at 24 °C (control), exposed to 35 °C for 30 min (S1), and 5 h (S2)). A total of 133,200,898 clean reads were assembled into 59,413 unigenes (average size 1259.55 bp). 33,573 unigenes could match to public databases. The DEG of controls vs S1 was 986, the DEG of control vs S2 was 1741 and the DEG of S1 vs S2 was 1587. Gene Ontology (GO) and pathway enrichment analysis indicated that a great deal of heat-responsive genes and other stress-responsive genes were identified in these DEGs, suggesting that the heat stress may have induced an extensive abiotic stress effect. Comparative transcriptome analysis found 896 unique genes in spinach heat response transcript. The expression patterns of 13 selected genes were verified by RT-qPCR (quantitative real-time PCR). Our study found a series of candidate genes and pathways that may be related to heat resistance in spinach.
Collapse
|
30
|
Castelli EC, Mendes-Junior CT, Sabbagh A, Porto IOP, Garcia A, Ramalho J, Lima THA, Massaro JD, Dias FC, Collares CVA, Jamonneau V, Bucheton B, Camara M, Donadi EA. HLA-E coding and 3' untranslated region variability determined by next-generation sequencing in two West-African population samples. Hum Immunol 2015; 76:945-53. [PMID: 26187162 DOI: 10.1016/j.humimm.2015.06.016] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2014] [Revised: 05/26/2015] [Accepted: 06/20/2015] [Indexed: 12/30/2022]
Abstract
HLA-E is a non-classical Human Leucocyte Antigen class I gene with immunomodulatory properties. Whereas HLA-E expression usually occurs at low levels, it is widely distributed amongst human tissues, has the ability to bind self and non-self antigens and to interact with NK cells and T lymphocytes, being important for immunosurveillance and also for fighting against infections. HLA-E is usually the most conserved locus among all class I genes. However, most of the previous studies evaluating HLA-E variability sequenced only a few exons or genotyped known polymorphisms. Here we report a strategy to evaluate HLA-E variability by next-generation sequencing (NGS) that might be used to other HLA loci and present the HLA-E haplotype diversity considering the segment encoding the entire HLA-E mRNA (including 5'UTR, introns and the 3'UTR) in two African population samples, Susu from Guinea-Conakry and Lobi from Burkina Faso. Our results indicate that (a) the HLA-E gene is indeed conserved, encoding mainly two different protein molecules; (b) Africans do present several unknown HLA-E alleles presenting synonymous mutations; (c) the HLA-E 3'UTR is quite polymorphic and (d) haplotypes in the HLA-E 3'UTR are in close association with HLA-E coding alleles. NGS has proved to be an important tool on data generation for future studies evaluating variability in non-classical MHC genes.
Collapse
Affiliation(s)
- Erick C Castelli
- School of Medicine of Botucatu, UNESP - Univ Estadual Paulista, Department of Pathology, Botucatu, State of São Paulo, Brazil.
| | - Celso T Mendes-Junior
- Departamento de Química, Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto, Universidade de São Paulo, 14040-901 Ribeirão Preto, SP, Brazil
| | - Audrey Sabbagh
- Institute of Research for Development, Mixed Research Unit 216 MERIT, Paris, France; Faculté de Pharmacie, Université Paris Descartes, Sorbonne Paris Cité, Paris, France
| | - Iane O P Porto
- School of Medicine of Botucatu, UNESP - Univ Estadual Paulista, Department of Pathology, Botucatu, State of São Paulo, Brazil
| | - André Garcia
- Institute of Research for Development, Mixed Research Unit 216 MERIT, Paris, France; Faculté de Pharmacie, Université Paris Descartes, Sorbonne Paris Cité, Paris, France
| | - Jaqueline Ramalho
- School of Medicine of Botucatu, UNESP - Univ Estadual Paulista, Department of Pathology, Botucatu, State of São Paulo, Brazil
| | - Thálitta H A Lima
- School of Medicine of Botucatu, UNESP - Univ Estadual Paulista, Department of Pathology, Botucatu, State of São Paulo, Brazil
| | - Juliana D Massaro
- Division of Clinical Immunology, Department of Medicine, School of Medicine of Ribeirão Preto, University of São Paulo - USP, Ribeirão Preto, SP, Brazil
| | - Fabrício C Dias
- Division of Clinical Immunology, Department of Medicine, School of Medicine of Ribeirão Preto, University of São Paulo - USP, Ribeirão Preto, SP, Brazil
| | - Cristhianna V A Collares
- Division of Clinical Immunology, Department of Medicine, School of Medicine of Ribeirão Preto, University of São Paulo - USP, Ribeirão Preto, SP, Brazil
| | - Vincent Jamonneau
- International Center for Development Research on Aging in Sub-Humid Areas (CIRDES), Bobo-Dioulasso, Burkina Faso; Institute of Research for Development, Mixed Research Unit IRD-CIRAD 177, Montpellier, France
| | - Bruno Bucheton
- Institute of Research for Development, Mixed Research Unit IRD-CIRAD 177, Montpellier, France; National Sleeping Sickness Control Program, Ministry of Health and Public Hygiene, Conakry, Guinea
| | - Mamadou Camara
- National Sleeping Sickness Control Program, Ministry of Health and Public Hygiene, Conakry, Guinea
| | - Eduardo A Donadi
- Division of Clinical Immunology, Department of Medicine, School of Medicine of Ribeirão Preto, University of São Paulo - USP, Ribeirão Preto, SP, Brazil
| |
Collapse
|
31
|
SeqMule: automated pipeline for analysis of human exome/genome sequencing data. Sci Rep 2015; 5:14283. [PMID: 26381817 PMCID: PMC4585643 DOI: 10.1038/srep14283] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2015] [Accepted: 08/21/2015] [Indexed: 11/16/2022] Open
Abstract
Next-generation sequencing (NGS) technology has greatly helped us identify disease-contributory variants for Mendelian diseases. However, users are often faced with issues such as software compatibility, complicated configuration, and no access to high-performance computing facility. Discrepancies exist among aligners and variant callers. We developed a computational pipeline, SeqMule, to perform automated variant calling from NGS data on human genomes and exomes. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling. In a modern machine (2 Intel Xeon X5650 CPUs, 48 GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates consensus call set that improves over single callers, as measured by both Mendelian error rate and consistency. SeqMule supports Sun Grid Engine for parallel processing, offers turn-key solution for deployment on Amazon Web Services, allows quality check, Mendelian error check, consistency evaluation, HTML-based reports. SeqMule is available at http://seqmule.openbioinformatics.org.
Collapse
|
32
|
Gamal El-Dien O, Ratcliffe B, Klápště J, Chen C, Porth I, El-Kassaby YA. Prediction accuracies for growth and wood attributes of interior spruce in space using genotyping-by-sequencing. BMC Genomics 2015; 16:370. [PMID: 25956247 PMCID: PMC4424896 DOI: 10.1186/s12864-015-1597-y] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2015] [Accepted: 04/28/2015] [Indexed: 02/02/2024] Open
Abstract
Background Genomic selection (GS) in forestry can substantially reduce the length of breeding cycle and increase gain per unit time through early selection and greater selection intensity, particularly for traits of low heritability and late expression. Affordable next-generation sequencing technologies made it possible to genotype large numbers of trees at a reasonable cost. Results Genotyping-by-sequencing was used to genotype 1,126 Interior spruce trees representing 25 open-pollinated families planted over three sites in British Columbia, Canada. Four imputation algorithms were compared (mean value (MI), singular value decomposition (SVD), expectation maximization (EM), and a newly derived, family-based k-nearest neighbor (kNN-Fam)). Trees were phenotyped for several yield and wood attributes. Single- and multi-site GS prediction models were developed using the Ridge Regression Best Linear Unbiased Predictor (RR-BLUP) and the Generalized Ridge Regression (GRR) to test different assumption about trait architecture. Finally, using PCA, multi-trait GS prediction models were developed. The EM and kNN-Fam imputation methods were superior for 30 and 60% missing data, respectively. The RR-BLUP GS prediction model produced better accuracies than the GRR indicating that the genetic architecture for these traits is complex. GS prediction accuracies for multi-site were high and better than those of single-sites while multi-site predictability produced the lowest accuracies reflecting type-b genetic correlations and deemed unreliable. The incorporation of genomic information in quantitative genetics analyses produced more realistic heritability estimates as half-sib pedigree tended to inflate the additive genetic variance and subsequently both heritability and gain estimates. Principle component scores as representatives of multi-trait GS prediction models produced surprising results where negatively correlated traits could be concurrently selected for using PCA2 and PCA3. Conclusions The application of GS to open-pollinated family testing, the simplest form of tree improvement evaluation methods, was proven to be effective. Prediction accuracies obtained for all traits greatly support the integration of GS in tree breeding. While the within-site GS prediction accuracies were high, the results clearly indicate that single-site GS models ability to predict other sites are unreliable supporting the utilization of multi-site approach. Principle component scores provided an opportunity for the concurrent selection of traits with different phenotypic optima. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1597-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Omnia Gamal El-Dien
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, 2424 Main Mall, Vancouver, British Columbia, V6T 1Z4, Canada.
| | - Blaise Ratcliffe
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, 2424 Main Mall, Vancouver, British Columbia, V6T 1Z4, Canada.
| | - Jaroslav Klápště
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, 2424 Main Mall, Vancouver, British Columbia, V6T 1Z4, Canada. .,Department of Genetics and Physiology of Forest Trees, Faculty of Forestry and Wood Sciences, Czech University of Life Sciences Prague, Kamycka 129, 165 21, Prague 6, Czech Republic.
| | - Charles Chen
- Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, OK, 74078-3035, USA.
| | - Ilga Porth
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, 2424 Main Mall, Vancouver, British Columbia, V6T 1Z4, Canada.
| | - Yousry A El-Kassaby
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, 2424 Main Mall, Vancouver, British Columbia, V6T 1Z4, Canada.
| |
Collapse
|
33
|
Abstract
Background High-throughput sequencing is a cost effective method for identifying genetic variation, and it is currently in use on a large scale across the field of biology, including ecology and population genetics. Correctly identifying variable sites and allele frequencies from sequencing data remains challenging, in large part due to artifacts and biases inherent in the sequencing process. Selecting variants that are diagnostic is commonly done using diversity statistics like FST, but these measures are not ideal for the task. Results Here, we develop a method that directly calculates the expected amount of information gained from observing each variant site. We then develop and implement a conservative estimator that takes into account uncertainity introduced by sampling bias and sequencing error. This estimator is applied to simulated and real sequencing data, and we discuss how it performs compared to the commonly used existing methods for identifying diagnostic polymorphisms. Conclusion The expected information content gives an easy to interpret measure for the usefulness of variant sites. The results show that we achieve a clear separation between true variants and noise, allowing us to select candidate sites with a high degree of confidence.
Collapse
|
34
|
Santana-Quintero L, Dingerdissen H, Thierry-Mieg J, Mazumder R, Simonyan V. HIVE-hexagon: high-performance, parallelized sequence alignment for next-generation sequencing data analysis. PLoS One 2014; 9:e99033. [PMID: 24918764 PMCID: PMC4053384 DOI: 10.1371/journal.pone.0099033] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2014] [Accepted: 05/09/2014] [Indexed: 12/31/2022] Open
Abstract
UNLABELLED Due to the size of Next-Generation Sequencing data, the computational challenge of sequence alignment has been vast. Inexact alignments can take up to 90% of total CPU time in bioinformatics pipelines. High-performance Integrated Virtual Environment (HIVE), a cloud-based environment optimized for storage and analysis of extra-large data, presents an algorithmic solution: the HIVE-hexagon DNA sequence aligner. HIVE-hexagon implements novel approaches to exploit both characteristics of sequence space and CPU, RAM and Input/Output (I/O) architecture to quickly compute accurate alignments. Key components of HIVE-hexagon include non-redundification and sorting of sequences; floating diagonals of linearized dynamic programming matrices; and consideration of cross-similarity to minimize computations. AVAILABILITY https://hive.biochemistry.gwu.edu/hive/
Collapse
Affiliation(s)
- Luis Santana-Quintero
- Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America
| | - Hayley Dingerdissen
- Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America
- Department of Biochemistry and Molecular Biology, George Washington University Medical Center, Washington, DC, United States of America
| | - Jean Thierry-Mieg
- National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Raja Mazumder
- Department of Biochemistry and Molecular Biology, George Washington University Medical Center, Washington, DC, United States of America
- * E-mail: (RM); (VS)
| | - Vahan Simonyan
- Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, Maryland, United States of America
- * E-mail: (RM); (VS)
| |
Collapse
|
35
|
Li MJ, Yan B, Sham PC, Wang J. Exploring the function of genetic variants in the non-coding genomic regions: approaches for identifying human regulatory variants affecting gene expression. Brief Bioinform 2014; 16:393-412. [PMID: 24916300 DOI: 10.1093/bib/bbu018] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2014] [Accepted: 04/23/2014] [Indexed: 12/13/2022] Open
Abstract
Understanding the genetic basis of human traits/diseases and the underlying mechanisms of how these traits/diseases are affected by genetic variations is critical for public health. Current genome-wide functional genomics data uncovered a large number of functional elements in the noncoding regions of human genome, providing new opportunities to study regulatory variants (RVs). RVs play important roles in transcription factor bindings, chromatin states and epigenetic modifications. Here, we systematically review an array of methods currently used to map RVs as well as the computational approaches in annotating and interpreting their regulatory effects, with emphasis on regulatory single-nucleotide polymorphism. We also briefly introduce experimental methods to validate these functional RVs.
Collapse
|
36
|
Massoumi Alamouti S, Haridas S, Feau N, Robertson G, Bohlmann J, Breuil C. Comparative Genomics of the Pine Pathogens and Beetle Symbionts in the Genus Grosmannia. Mol Biol Evol 2014; 31:1454-74. [DOI: 10.1093/molbev/msu102] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
|
37
|
White BS, DiPersio JF. Genomic tools in acute myeloid leukemia: From the bench to the bedside. Cancer 2014; 120:1134-44. [PMID: 24474533 DOI: 10.1002/cncr.28552] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2013] [Accepted: 11/14/2013] [Indexed: 12/28/2022]
Abstract
Since its use in the initial characterization of an acute myeloid leukemia (AML) genome, next-generation sequencing (NGS) has continued to molecularly refine the disease. Here, the authors review the spectrum of NGS applications that have subsequently delineated the prognostic significance and biologic consequences of these mutations. Furthermore, the role of this technology in providing a high-resolution glimpse of AML clonal heterogeneity, which may inform future choice of targeted therapy, is discussed. Although obstacles remain in applying these techniques clinically, they have already had an impact on patient care.
Collapse
Affiliation(s)
- Brian S White
- Department of Internal Medicine, Division of Oncology, Washington University School of Medicine, St. Louis, Missouri; The Genome Institute, Washington University, St. Louis, Missouri
| | | |
Collapse
|
38
|
|
39
|
Bosdet IE, Docking TR, Butterfield YS, Mungall AJ, Zeng T, Coope RJ, Yorida E, Chow K, Bala M, Young SS, Hirst M, Birol I, Moore RA, Jones SJ, Marra MA, Holt R, Karsan A. A Clinically Validated Diagnostic Second-Generation Sequencing Assay for Detection of Hereditary BRCA1 and BRCA2 Mutations. J Mol Diagn 2013; 15:796-809. [DOI: 10.1016/j.jmoldx.2013.07.004] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2012] [Revised: 07/09/2013] [Accepted: 07/17/2013] [Indexed: 12/16/2022] Open
|
40
|
Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics 2013; 14 Suppl 11:S1. [PMID: 24564169 PMCID: PMC3846878 DOI: 10.1186/1471-2105-14-s11-s1] [Citation(s) in RCA: 350] [Impact Index Per Article: 29.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development.
Collapse
|
41
|
Simmons HE, Dunham JP, Zinn KE, Munkvold GP, Holmes EC, Stephenson AG. Zucchini yellow mosaic virus (ZYMV, Potyvirus): vertical transmission, seed infection and cryptic infections. Virus Res 2013; 176:259-64. [PMID: 23845301 PMCID: PMC3774540 DOI: 10.1016/j.virusres.2013.06.016] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2013] [Revised: 06/25/2013] [Accepted: 06/28/2013] [Indexed: 12/31/2022]
Abstract
The role played by seed transmission in the evolution and epidemiology of viral crop pathogens remains unclear. We determined the seed infection and vertical transmission rates of zucchini yellow mosaic virus (ZYMV), in addition to undertaking Illumina sequencing of nine vertically transmitted ZYMV populations. We previously determined the seed-to-seedling transmission rate of ZYMV in Cucurbita pepo ssp. texana (a wild gourd) to be 1.6%, and herein observed a similar rate (1.8%) in the subsequent generation. We also observed that the seed infection rate is substantially higher (21.9%) than the seed-to-seedling transmission rate, suggesting that a major population bottleneck occurs during seed germination and seedling growth. In contrast, that two thirds of the variants present in the horizontally transmitted inoculant population were also present in the vertically transmitted populations implies that the bottleneck at vertical transmission may not be particularly severe. Strikingly, all of the vertically infected plants were symptomless in contrast to those infected horizontally, suggesting that vertical infection may be cryptic. Although no known virulence determining mutations were observed in the vertically infected samples, the 5' untranslated region was highly variable, with at least 26 different major haplotypes in this region compared to the two major haplotypes observed in the horizontally transmitted population. That the regions necessary for vector transmission are retained in the vertically infected populations, combined with the cryptic nature of vertical infection, suggests that seed transmission may be a significant contributor to the spread of ZYMV.
Collapse
Affiliation(s)
- H E Simmons
- Seed Science Center, Iowa State University, Ames, IA 50011, USA.
| | | | | | | | | | | |
Collapse
|
42
|
Bian J, Liu C, Wang H, Xing J, Kachroo P, Zhou X. SNVHMM: predicting single nucleotide variants from next generation sequencing. BMC Bioinformatics 2013; 14:225. [PMID: 23855743 PMCID: PMC3718670 DOI: 10.1186/1471-2105-14-225] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2013] [Accepted: 07/03/2013] [Indexed: 02/04/2023] Open
Abstract
Background The rapid development of next generation sequencing (NGS) technology provides a novel avenue for genomic exploration and research. Single nucleotide variants (SNVs) inferred from next generation sequencing are expected to reveal gene mutations in cancer. However, NGS has lower sequence coverage and poor SNVs detection capability in the regulatory regions of the genome. Post probabilistic based methods are efficient for detection of SNVs in high coverage regions or sequencing data with high depth. However, for data with low sequencing depth, the efficiency of such algorithms remains poor and needs to be improved. Results A new tool SNVHMM basing on a discrete hidden Markov model (HMM) was developed to infer the genotype for each position on the genome. We incorporated the mapping quality of each read and the corresponding base quality on the reads into the emission probability of HMM. The context information of the whole observation as well as its confidence were completely utilized to infer the genotype for each position on the genome in study. Therefore, more probability power can be gained over the Bayes based methods, which is very useful for SNVs detection for data with low sequencing depth. Moreover, our model was verified by testing against two sets of lobular breast tumor and Myelodysplastic Syndromes (MDS) data each. Comparing against a recently published SNVs calling algorithm SNVMix2, our model improved the performance of SNVMix2 largely when the sequencing depth is low and also outperformed SNVMix2 when SNVMix2 is well trained by large datasets. Conclusions SNVHMM can detect SNVs from NGS cancer data efficiently even if the sequence depth is very low. The training data size can be very small for SNVHMM to work. SNVHMM incorporated the base quality and mapping quality of all observed bases and reads, and also provides the option for users to choose the confidence of the observation for SNVs prediction.
Collapse
|
43
|
Next-generation sequencing and phylogenetic signal of complete mitochondrial genomes for resolving the evolutionary history of leaf-nosed bats (Phyllostomidae). Mol Phylogenet Evol 2013; 69:728-39. [PMID: 23850499 DOI: 10.1016/j.ympev.2013.07.003] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2013] [Revised: 06/19/2013] [Accepted: 07/03/2013] [Indexed: 12/11/2022]
Abstract
Leaf-nosed bats (Phyllostomidae) are one of the most studied groups within the order Chiroptera mainly because of their outstanding species richness and diversity in morphological and ecological traits. Rapid diversification and multiple homoplasies have made the phylogeny of the family difficult to solve using morphological characters. Molecular data have contributed to shed light on the evolutionary history of phyllostomid bats, yet several relationships remain unresolved at the intra-familial level. Complete mitochondrial genomes have proven useful to deal with this kind of situation in other groups of mammals by providing access to a large number of molecular characters. At present, there are only two mitogenomes available for phyllostomid bats hinting at the need for further exploration of the mitogenomic approach in this group. We used both standard Sanger sequencing of PCR products and next-generation sequencing (NGS) of shotgun genomic DNA to obtain new complete mitochondrial genomes from 10 species of phyllostomid bats, including representatives of major subfamilies, plus one outgroup belonging to the closely-related mormoopids. We then evaluated the contribution of mitogenomics to the resolution of the phylogeny of leaf-nosed bats and compared the results to those based on mitochondrial genes and the RAG2 and VWF nuclear makers. Our results demonstrate the advantages of the Illumina NGS approach to efficiently obtain mitogenomes of phyllostomid bats. The phylogenetic signal provided by entire mitogenomes is highly comparable to the one of a concatenation of individual mitochondrial and nuclear markers, and allows increasing both resolution and statistical support for several clades. This enhanced phylogenetic signal is the result of combining markers with heterogeneous evolutionary rates representing a large number of nucleotide sites. Our results illustrate the potential of the NGS mitogenomic approach for resolving the evolutionary history of phyllostomid bats based on a denser species sampling.
Collapse
|
44
|
Wang Y, Xu L, Chen Y, Shen H, Gong Y, Limera C, Liu L. Transcriptome profiling of radish (Raphanus sativus L.) root and identification of genes involved in response to Lead (Pb) stress with next generation sequencing. PLoS One 2013; 8:e66539. [PMID: 23840502 PMCID: PMC3688795 DOI: 10.1371/journal.pone.0066539] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2013] [Accepted: 05/07/2013] [Indexed: 11/19/2022] Open
Abstract
Lead (Pb), one of the most toxic heavy metals, can be absorbed and accumulated by plant roots and then enter the food chain resulting in potential health risks for human beings. The radish (Raphanus sativus L.) is an important root vegetable crop with fleshy taproots as the edible parts. Little is known about the mechanism by which radishes respond to Pb stress at the molecular level. In this study, Next Generation Sequencing (NGS)-based RNA-seq technology was employed to characterize the de novo transcriptome of radish roots and identify differentially expressed genes (DEGs) during Pb stress. A total of 68,940 assembled unique transcripts including 33,337 unigenes were obtained from radish root cDNA samples. Based on the assembled de novo transcriptome, 4,614 DEGs were detected between the two libraries of untreated (CK) and Pb-treated (Pb1000) roots. Gene Ontology (GO) and pathway enrichment analysis revealed that upregulated DEGs under Pb stress are predominately involved in defense responses in cell walls and glutathione metabolism-related processes, while downregulated DEGs were mainly involved in carbohydrate metabolism-related pathways. The expression patterns of 22 selected genes were validated by quantitative real-time PCR, and the results were highly accordant with the Solexa analysis. Furthermore, many candidate genes, which were involved in defense and detoxification mechanisms including signaling protein kinases, transcription factors, metal transporters and chelate compound biosynthesis related enzymes, were successfully identified in response to heavy metal Pb. Identification of potential DEGs involved in responses to Pb stress significantly reflected alterations in major biological processes and metabolic pathways. The molecular basis of the response to Pb stress in radishes was comprehensively characterized. Useful information and new insights were provided for investigating the molecular regulation mechanism of heavy metal Pb accumulation and tolerance in root vegetable crops.
Collapse
Affiliation(s)
- Yan Wang
- National Key Laboratory of Crop Genetics and Germplasm Enhancement, Engineering Research Center of Horticultural Crop Germplasm Enhancement and Utilization, Ministry of Education of P. R. China
- College of Horticulture, Nanjing Agricultural University, Nanjing, P. R. China
| | - Liang Xu
- National Key Laboratory of Crop Genetics and Germplasm Enhancement, Engineering Research Center of Horticultural Crop Germplasm Enhancement and Utilization, Ministry of Education of P. R. China
- College of Horticulture, Nanjing Agricultural University, Nanjing, P. R. China
| | - Yinglong Chen
- School of Earth and Environment, and The UWA’s Institute of Agriculture, The University of Western Australia, Perth, WA, Australia
| | - Hong Shen
- National Key Laboratory of Crop Genetics and Germplasm Enhancement, Engineering Research Center of Horticultural Crop Germplasm Enhancement and Utilization, Ministry of Education of P. R. China
- College of Horticulture, Nanjing Agricultural University, Nanjing, P. R. China
| | - Yiqin Gong
- National Key Laboratory of Crop Genetics and Germplasm Enhancement, Engineering Research Center of Horticultural Crop Germplasm Enhancement and Utilization, Ministry of Education of P. R. China
- College of Horticulture, Nanjing Agricultural University, Nanjing, P. R. China
| | - Cecilia Limera
- National Key Laboratory of Crop Genetics and Germplasm Enhancement, Engineering Research Center of Horticultural Crop Germplasm Enhancement and Utilization, Ministry of Education of P. R. China
- College of Horticulture, Nanjing Agricultural University, Nanjing, P. R. China
| | - Liwang Liu
- National Key Laboratory of Crop Genetics and Germplasm Enhancement, Engineering Research Center of Horticultural Crop Germplasm Enhancement and Utilization, Ministry of Education of P. R. China
- College of Horticulture, Nanjing Agricultural University, Nanjing, P. R. China
- * E-mail:
| |
Collapse
|
45
|
Xu F, Wang W, Wang P, Jun Li M, Chung Sham P, Wang J. A fast and accurate SNP detection algorithm for next-generation sequencing data. Nat Commun 2013; 3:1258. [PMID: 23212387 DOI: 10.1038/ncomms2256] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2012] [Accepted: 11/05/2012] [Indexed: 12/17/2022] Open
Abstract
Various methods have been developed for calling single-nucleotide polymorphisms from next-generation sequencing data. However, for satisfactory performance, most of these methods require expensive high-depth sequencing. Here, we propose a fast and accurate single-nucleotide polymorphism detection program that uses a binomial distribution-based algorithm and a mutation probability. We extensively assess this program on normal and cancer next-generation sequencing data from The Cancer Genome Atlas project and pooled data from the 1,000 Genomes Project. We also compare the performance of several state-of-the-art programs for single-nucleotide polymorphism calling and evaluate their pros and cons. We demonstrate that our program is a fast and highly accurate single-nucleotide polymorphism detection method, particularly when the sequence depth is low. The program can finish single-nucleotide polymorphism calling within four hours for 10-fold human genome next-generation sequencing data (30 gigabases) on a standard desktop computer.
Collapse
Affiliation(s)
- Feng Xu
- Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | | | | | | | | | | |
Collapse
|
46
|
Rieber N, Zapatka M, Lasitschka B, Jones D, Northcott P, Hutter B, Jäger N, Kool M, Taylor M, Lichter P, Pfister S, Wolf S, Brors B, Eils R. Coverage bias and sensitivity of variant calling for four whole-genome sequencing technologies. PLoS One 2013; 8:e66621. [PMID: 23776689 PMCID: PMC3679043 DOI: 10.1371/journal.pone.0066621] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2012] [Accepted: 05/08/2013] [Indexed: 01/21/2023] Open
Abstract
The emergence of high-throughput, next-generation sequencing technologies has dramatically altered the way we assess genomes in population genetics and in cancer genomics. Currently, there are four commonly used whole-genome sequencing platforms on the market: Illumina's HiSeq2000, Life Technologies' SOLiD 4 and its completely redesigned 5500xl SOLiD, and Complete Genomics' technology. A number of earlier studies have compared a subset of those sequencing platforms or compared those platforms with Sanger sequencing, which is prohibitively expensive for whole genome studies. Here we present a detailed comparison of the performance of all currently available whole genome sequencing platforms, especially regarding their ability to call SNVs and to evenly cover the genome and specific genomic regions. Unlike earlier studies, we base our comparison on four different samples, allowing us to assess the between-sample variation of the platforms. We find a pronounced GC bias in GC-rich regions for Life Technologies' platforms, with Complete Genomics performing best here, while we see the least bias in GC-poor regions for HiSeq2000 and 5500xl. HiSeq2000 gives the most uniform coverage and displays the least sample-to-sample variation. In contrast, Complete Genomics exhibits by far the smallest fraction of bases not covered, while the SOLiD platforms reveal remarkable shortcomings, especially in covering CpG islands. When comparing the performance of the four platforms for calling SNPs, HiSeq2000 and Complete Genomics achieve the highest sensitivity, while the SOLiD platforms show the lowest false positive rate. Finally, we find that integrating sequencing data from different platforms offers the potential to combine the strengths of different technologies. In summary, our results detail the strengths and weaknesses of all four whole-genome sequencing platforms. It indicates application areas that call for a specific sequencing platform and disallow other platforms. This helps to identify the proper sequencing platform for whole genome studies with different application scopes.
Collapse
Affiliation(s)
- Nora Rieber
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Marc Zapatka
- Division of Molecular Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Bärbel Lasitschka
- Genomics and Proteomics Core Facility, High Throughput Sequencing Unit, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - David Jones
- Division of Pediatric Neurooncology, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Paul Northcott
- The Arthur and Sonia Labatt Brain Tumor Research Centre, The Hospital for Sick Children Research Institute, University of Toronto, Ontario, Canada
| | - Barbara Hutter
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Natalie Jäger
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Marcel Kool
- Division of Pediatric Neurooncology, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Michael Taylor
- The Arthur and Sonia Labatt Brain Tumor Research Centre, The Hospital for Sick Children Research Institute, University of Toronto, Ontario, Canada
- Division of Neurosurgery, The Hospital for Sick Children, University of Toronto, Ontario, Canada
| | - Peter Lichter
- Division of Molecular Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Stefan Pfister
- Division of Pediatric Neurooncology, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Department of Pediatric Hematology and Oncology, Heidelberg University Hospital, Heidelberg, Germany
| | - Stephan Wolf
- Genomics and Proteomics Core Facility, High Throughput Sequencing Unit, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Benedikt Brors
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Roland Eils
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Department of Bioinformatics and Functional Genomics, Institute of Pharmacy and Molecular Biotechnology, and Bioquant, University of Heidelberg, Heidelberg, Germany
| |
Collapse
|
47
|
Shyr D, Liu Q. Next generation sequencing in cancer research and clinical application. Biol Proced Online 2013; 15:4. [PMID: 23406336 PMCID: PMC3599179 DOI: 10.1186/1480-9222-15-4] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2013] [Accepted: 02/09/2013] [Indexed: 01/29/2023] Open
Abstract
The wide application of next-generation sequencing (NGS), mainly through whole genome, exome and transcriptome sequencing, provides a high-resolution and global view of the cancer genome. Coupled with powerful bioinformatics tools, NGS promises to revolutionize cancer research, diagnosis and therapy. In this paper, we review the recent advances in NGS-based cancer genomic research as well as clinical application, summarize the current integrative oncogenomic projects, resources and computational algorithms, and discuss the challenge and future directions in the research and clinical application of cancer genomic sequencing.
Collapse
Affiliation(s)
- Derek Shyr
- Washington University, 63130, St. Louis, MO, USA
| | - Qi Liu
- Center for Quantitative Sciences, Vanderbilt University School of Medicine, 37232, Nashville, TN, USA
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, 37232, Nashville, TN, USA
| |
Collapse
|
48
|
Clark SC, Egan R, Frazier PI, Wang Z. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. ACTA ACUST UNITED AC 2013; 29:435-43. [PMID: 23303509 DOI: 10.1093/bioinformatics/bts723] [Citation(s) in RCA: 116] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
MOTIVATION Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies. RESULTS In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process. AVAILABILITY ALE is released as open source software under the UoI/NCSA license at http://www.alescore.org. It is implemented in C and Python.
Collapse
Affiliation(s)
- Scott C Clark
- Center for Applied Mathematics, Cornell University, Ithaca, NY 14853, USA
| | | | | | | |
Collapse
|
49
|
Knauer SK, Unruhe B, Karczewski S, Hecht R, Fetz V, Bier C, Friedl S, Wollenberg B, Pries R, Habtemichael N, Heinrich UR, Stauber RH. Functional characterization of novel mutations affecting survivin (BIRC5)-mediated therapy resistance in head and neck cancer patients. Hum Mutat 2012; 34:395-404. [PMID: 23161837 DOI: 10.1002/humu.22249] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2012] [Accepted: 10/31/2012] [Indexed: 01/25/2023]
Abstract
Survivin (BIRC5) is an acknowledged cancer therapy-resistance factor and overexpressed in head and neck squamous cell carcinomas (HNSCC). Driven by its nuclear export signal (NES), Survivin shuttles between the nucleus and the cytoplasm, and is detectable in both cellular compartments in tumor biopsies. Although predominantly nuclear Survivin is considered a favorable prognostic disease marker for HNSCC patients, the underlying molecular mechanisms are not resolved. Hence, we performed immunohistochemical and mutational analyses using laser capture microdissection on HNSCC biopsies from patients displaying high levels of nuclear Survivin. We found somatic BIRC5 mutations, c.278T>C (p.Phe93Ser), c.292C>T (p.Leu98Phe), and c.288A>G (silent), in tumor cells, but not in corresponding normal tissues. Comprehensive functional characterization of the Survivin mutants by ectopic expression and microinjection experiments revealed that p.Phe93Ser, but not p.Leu98Phe inactivated Survivin's NES, resulted in a predominantly nuclear protein, and attenuated Survivin's dual cytoprotective activity against chemoradiation-induced apoptosis. Notably, in xenotransplantation studies, HNSCC cells containing the p.Phe93Ser mutation responded significantly better to cisplatin-based chemotherapy. Collectively, our results underline the disease relevance of Survivin's nucleocytoplasmic transport, and provide first evidence that genetic inactivation of Survivin's NES may account for predominantly nuclear Survivin and increased therapy response in cancer patients.
Collapse
Affiliation(s)
- Shirley K Knauer
- Institute for Molecular Biology, Centre for Medical Biotechnology, ZMB, University of Duisburg-Essen, Germany.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
50
|
Kumar S, You FM, Cloutier S. Genome wide SNP discovery in flax through next generation sequencing of reduced representation libraries. BMC Genomics 2012; 13:684. [PMID: 23216845 PMCID: PMC3557168 DOI: 10.1186/1471-2164-13-684] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2012] [Accepted: 11/29/2012] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Flax (Linum usitatissimum L.) is a significant fibre and oilseed crop. Current flax molecular markers, including isozymes, RAPDs, AFLPs and SSRs are of limited use in the construction of high density linkage maps and for association mapping applications due to factors such as low reproducibility, intense labour requirements and/or limited numbers. We report here on the use of a reduced representation library strategy combined with next generation Illumina sequencing for rapid and large scale discovery of SNPs in eight flax genotypes. SNP discovery was performed through in silico analysis of the sequencing data against the whole genome shotgun sequence assembly of flax genotype CDC Bethune. Genotyping-by-sequencing of an F6-derived recombinant inbred line population provided validation of the SNPs. RESULTS Reduced representation libraries of eight flax genotypes were sequenced on the Illumina sequencing platform resulting in sequence coverage ranging from 4.33 to 15.64X (genome equivalents). Depending on the relatedness of the genotypes and the number and length of the reads, between 78% and 93% of the reads mapped onto the CDC Bethune whole genome shotgun sequence assembly. A total of 55,465 SNPs were discovered with the largest number of SNPs belonging to the genotypes with the highest mapping coverage percentage. Approximately 84% of the SNPs discovered were identified in a single genotype, 13% were shared between any two genotypes and the remaining 3% in three or more. Nearly a quarter of the SNPs were found in genic regions. A total of 4,706 out of 4,863 SNPs discovered in Macbeth were validated using genotyping-by-sequencing of 96 F6 individuals from a recombinant inbred line population derived from a cross between CDC Bethune and Macbeth, corresponding to a validation rate of 96.8%. CONCLUSIONS Next generation sequencing of reduced representation libraries was successfully implemented for genome-wide SNP discovery from flax. The genotyping-by-sequencing approach proved to be efficient for validation. The SNP resources generated in this work will assist in generating high density maps of flax and facilitate QTL discovery, marker-assisted selection, phylogenetic analyses, association mapping and anchoring of the whole genome shotgun sequence.
Collapse
Affiliation(s)
- Santosh Kumar
- Cereal Research Centre, Agriculture and Agri-Food Canada, 195 Dafoe Road, Winnipeg, Manitoba, R3T 2M9, Canada
- Department of Plant Science, University of Manitoba, 66 Dafoe Road, Winnipeg, Manitoba, R3T 2N2, Canada
| | - Frank M You
- Cereal Research Centre, Agriculture and Agri-Food Canada, 195 Dafoe Road, Winnipeg, Manitoba, R3T 2M9, Canada
| | - Sylvie Cloutier
- Cereal Research Centre, Agriculture and Agri-Food Canada, 195 Dafoe Road, Winnipeg, Manitoba, R3T 2M9, Canada
- Department of Plant Science, University of Manitoba, 66 Dafoe Road, Winnipeg, Manitoba, R3T 2N2, Canada
| |
Collapse
|