1
|
Jiang Z, Peng Z, Wei Z, Sun J, Luo Y, Bie L, Zhang G, Wang Y. A deep learning-based method enables the automatic and accurate assembly of chromosome-level genomes. Nucleic Acids Res 2024:gkae789. [PMID: 39287126 DOI: 10.1093/nar/gkae789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 08/25/2024] [Accepted: 08/30/2024] [Indexed: 09/19/2024] Open
Abstract
The application of high-throughput chromosome conformation capture (Hi-C) technology enables the construction of chromosome-level assemblies. However, the correction of errors and the anchoring of sequences to chromosomes in the assembly remain significant challenges. In this study, we developed a deep learning-based method, AutoHiC, to address the challenges in chromosome-level genome assembly by enhancing contiguity and accuracy. Conventional Hi-C-aided scaffolding often requires manual refinement, but AutoHiC instead utilizes Hi-C data for automated workflows and iterative error correction. When trained on data from 300+ species, AutoHiC demonstrated a robust average error detection accuracy exceeding 90%. The benchmarking results confirmed its significant impact on genome contiguity and error correction. The innovative approach and comprehensive results of AutoHiC constitute a breakthrough in automated error detection, promising more accurate genome assemblies for advancing genomics research.
Collapse
Affiliation(s)
- Zijie Jiang
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Zhixiang Peng
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Zhaoyuan Wei
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Jiahe Sun
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Yongjiang Luo
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Lingzi Bie
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Guoqing Zhang
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| | - Yi Wang
- Integrative Science Center of Germplasm Creation in Western China (CHONGQING) Science City, Biological Science Research Center, Southwest University, Chongqing, China
| |
Collapse
|
2
|
Mante J, Groover KE, Pullen RM. Environmental community transcriptomics: strategies and struggles. Brief Funct Genomics 2024:elae033. [PMID: 39183066 DOI: 10.1093/bfgp/elae033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Revised: 08/02/2024] [Accepted: 08/08/2024] [Indexed: 08/27/2024] Open
Abstract
Transcriptomics is the study of RNA transcripts, the portion of the genome that is transcribed, in a specific cell, tissue, or organism. Transcriptomics provides insight into gene expression patterns, regulation, and the underlying mechanisms of cellular processes. Community transcriptomics takes this a step further by studying the RNA transcripts from environmental assemblies of organisms, with the intention of better understanding the interactions between members of the community. Community transcriptomics requires successful extraction of RNA from a diverse set of organisms and subsequent analysis via mapping those reads to a reference genome or de novo assembly of the reads. Both, extraction protocols and the analysis steps can pose hurdles for community transcriptomics. This review covers advances in transcriptomic techniques and assesses the viability of applying them to community transcriptomics.
Collapse
Affiliation(s)
- Jeanet Mante
- Oak Ridge Associated Universities, Oak Ridge, 37831, TN, USA
| | - Kyra E Groover
- Department of Molecular Biosciences, University of Texas at Austin, Austin, 78705, TX, USA
| | - Randi M Pullen
- DEVCOM Army Research Laboratory, Adelphi, 20783, MD, USA
| |
Collapse
|
3
|
Ilík V, Schwarz EM, Nosková E, Pafčo B. Hookworm genomics: dusk or dawn? Trends Parasitol 2024; 40:452-465. [PMID: 38677925 DOI: 10.1016/j.pt.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 03/28/2024] [Accepted: 04/04/2024] [Indexed: 04/29/2024]
Abstract
Hookworms are parasites, closely related to the model nematode Caenorhabditis elegans, that are a major economic and health burden worldwide. Primarily three hookworm species (Necator americanus, Ancylostoma duodenale, and Ancylostoma ceylanicum) infect humans. Another 100 hookworm species from 19 genera infect primates, ruminants, and carnivores. Genetic data exist for only seven of these species. Genome sequences are available from only four of these species in two genera, leaving 96 others (particularly those parasitizing wildlife) without any genomic data. The most recent hookworm genomes were published 5 years ago, leaving the field in a dusk. However, assembling genomes from single hookworms may bring a new dawn. Here we summarize advances, challenges, and opportunities for studying these neglected but important parasitic nematodes.
Collapse
Affiliation(s)
- Vladislav Ilík
- Institute of Vertebrate Biology, Czech Academy of Sciences, Brno, Czech Republic; Department of Botany and Zoology, Faculty of Science, Masaryk University, Brno, Czech Republic.
| | - Erich M Schwarz
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
| | - Eva Nosková
- Institute of Vertebrate Biology, Czech Academy of Sciences, Brno, Czech Republic; Department of Botany and Zoology, Faculty of Science, Masaryk University, Brno, Czech Republic
| | - Barbora Pafčo
- Institute of Vertebrate Biology, Czech Academy of Sciences, Brno, Czech Republic.
| |
Collapse
|
4
|
Wade KJ, Suseno R, Kizer K, Williams J, Boquett J, Caillier S, Pollock NR, Renschen A, Santaniello A, Oksenberg JR, Norman PJ, Augusto DG, Hollenbach JA. MHConstructor: A high-throughput, haplotype-informed solution to the MHC assembly challenge. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.20.595060. [PMID: 38826378 PMCID: PMC11142050 DOI: 10.1101/2024.05.20.595060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
The extremely high levels of genetic polymorphism within the human major histocompatibility complex (MHC) limit the usefulness of reference-based alignment methods for sequence assembly. We incorporate a short read de novo assembly algorithm into a workflow for novel application to the MHC. MHConstructor is a containerized pipeline designed for high-throughput, haplotype-informed, reproducible assembly of both whole genome sequencing and target-capture short read data in large, population cohorts. To-date, no other self-contained tool exists for the generation of de novo MHC assemblies from short read data. MHConstructor facilitates wide-spread access to high quality, alignment-free MHC sequence analysis.
Collapse
Affiliation(s)
- Kristen J. Wade
- Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - Rayo Suseno
- Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - Kerry Kizer
- Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - Jacqueline Williams
- Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - Juliano Boquett
- Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - Stacy Caillier
- Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - Nicholas R. Pollock
- Department of Biomedical Informatics, Anschutz Medical Campus, University of Colorado, Aurora, Colorado, USA
- Department of Immunology and Microbiology, Anschutz Medical Campus, University of Colorado, Aurora, Colorado, USA
| | - Adam Renschen
- Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - Adam Santaniello
- Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - Jorge R. Oksenberg
- Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - Paul J. Norman
- Department of Biomedical Informatics, Anschutz Medical Campus, University of Colorado, Aurora, Colorado, USA
- Department of Immunology and Microbiology, Anschutz Medical Campus, University of Colorado, Aurora, Colorado, USA
| | - Danillo G. Augusto
- Department of Biological Sciences, University of North Carolina Charlotte, Charlotte, NC, United States
- Programa de Pós-Graduação em Genética, Universidade Federal do Paraná, Curitiba, Brazil
| | - Jill A. Hollenbach
- Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States
- Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, United States
| |
Collapse
|
5
|
D'Addiego J, Wand N, Afrough B, Fletcher T, Kurosaki Y, Leblebicioglu H, Hewson R. Recovery of complete genome sequences of Crimean-Congo haemorrhagic fever virus (CCHFV) directly from clinical samples: A comparative study between targeted enrichment and metagenomic approaches. J Virol Methods 2024; 323:114833. [PMID: 37879367 DOI: 10.1016/j.jviromet.2023.114833] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 10/06/2023] [Accepted: 10/20/2023] [Indexed: 10/27/2023]
Abstract
Crimean-Congo haemorrhagic fever (CCHF) is the most prevalent human tick-borne viral disease, endemic to the Balkans, Africa, Middle East and Asia. There are currently no licensed vaccines or effective antivirals against CCHF. CCHF virus (CCHFV) has a negative sense segmented tripartite RNA genome consisting of the small (S), medium (M) and large (L) segments. Depending on the segment utilised for genetic affiliation, there are up to 7 circulating lineages of CCHFV. The current lack of geographical representation of CCHFV sequences in various repositories highlights a requirement for increased CCHFV sequencing capabilities in endemic regions. We have optimised and established a multiplex PCR tiling methodology for the targeted enrichment of complete genomes of Europe 1 CCHFV lineage directly from clinical samples and compared its performance to a non-targeted enrichment approach on both short-read and long-read sequencing platforms. We have found a statistically significant increase in mapped viral sequencing reads produced with our targeted enrichment approach. This has allowed us to recover near complete S segment sequences and above 90% of the M and L segment sequences for samples with Ct values as high as 31.3. This study demonstrates the superiority of a targeted enrichment approach for recovery of CCHFV genomic sequences from samples with low virus titre. CCHFV is an important vector-borne human pathogen with wide geographical distribution. The validated methodology reported here adds value to front-line public health laboratories employing genomic sequencing for CCHFV Europe 1 lineage surveillance, particularly in the Balkan and Middle Eastern territories currently monitoring the spread of the pathogen. Tracking the genomic evolution of the virus across regions improves risk assessment and directly informs the development of diagnostics, therapeutics, and vaccines.
Collapse
Affiliation(s)
- Jake D'Addiego
- UK Health Security Agency, Science Group, Porton Down, Salisbury, United Kingdom; Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, United Kingdom.
| | - Nadina Wand
- UK Health Security Agency, Science Group, Porton Down, Salisbury, United Kingdom
| | - Babak Afrough
- UK Health Security Agency, Science Group, Porton Down, Salisbury, United Kingdom
| | - Tom Fletcher
- Department of Clinical Sciences, Liverpool School of Tropical Medicine, Liverpool, United Kingdom
| | - Yohei Kurosaki
- National Research Centre for the Control and Prevention of Infectious Diseases, Nagasaki University, Japan
| | | | - Roger Hewson
- UK Health Security Agency, Science Group, Porton Down, Salisbury, United Kingdom; Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, United Kingdom; Department of Clinical Sciences, Liverpool School of Tropical Medicine, Liverpool, United Kingdom; National Research Centre for the Control and Prevention of Infectious Diseases, Nagasaki University, Japan
| |
Collapse
|
6
|
Srivastava SK, Parker C, O'Brien CN, Tucker MS, Thompson PC, Rosenthal BM, Dubey JP, Khan A, Jenkins MC. Chromosomal scale assembly reveals localized structural variants in avian caecal coccidian parasite Eimeria tenella. Sci Rep 2023; 13:22802. [PMID: 38129566 PMCID: PMC10739835 DOI: 10.1038/s41598-023-50117-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 12/15/2023] [Indexed: 12/23/2023] Open
Abstract
Eimeria tenella is a major cause of caecal coccidiosis in commercial poultry chickens worldwide. Here, we report chromosomal scale assembly of Eimeria tenella strain APU2, a strain isolated from commercial broiler chickens in the U.S. We obtained 100× sequencing Oxford Nanopore Technology (ONT) and more than 800× Coverage of Illumina Next-Seq. We created the assembly using the hybrid approach implemented in MaSuRCA, achieving a contiguous 51.34 Mb chromosomal-scale scaffolding enabling identification of structural variations. The AUGUSTUS pipeline predicted 8060 genes, and BUSCO deemed the genomes 99% complete; 6278 (78%) genes were annotated with Pfam domains, and 1395 genes were assigned GO-terms. Comparing E. tenella strains (APU2, US isolate and Houghton, UK isolate) derived Houghton strain of E. tenella revealed 62,905 high stringency differences, of which 45,322 are single nucleotide polymorphisms (SNPs) (0.088%). The rate of transitions/transversions among the SNPs are 1.63 ts/tv. The strains possess conserved gene order but have profound sequence heterogeneity in a several chromosomal segments (chr 2, 11 and 15). Genic and intergenic variation in defined gene families was evaluated between the two strains to possibly identify sequences under selection. The average genic nucleotide diversity of 2.8 with average 2 kb gene length (0.145%) at genic level. We examined population structure using available E. tenella sequences in NCBI, revealing that the two E. tenella isolates from the U.S. (E. tenella APU2 and Wisconsin, "ERR296879") share a common maternal inheritance with the E. tenella Houghton. Our chromosomal level assembly promotes insight into Eimeria biology and evolution, hastening drug discovery and vaccine development.
Collapse
Affiliation(s)
- Subodh K Srivastava
- USDA-ARS Animal Parasitic Diseases Laboratory, Beltsville Agricultural Research Center, BARC-East Building 1040, 10300 Baltimore Ave., Beltsville, MD, 20705, USA.
| | - Carolyn Parker
- USDA-ARS Animal Parasitic Diseases Laboratory, Beltsville Agricultural Research Center, BARC-East Building 1040, 10300 Baltimore Ave., Beltsville, MD, 20705, USA
| | - Celia N O'Brien
- USDA-ARS Animal Parasitic Diseases Laboratory, Beltsville Agricultural Research Center, BARC-East Building 1040, 10300 Baltimore Ave., Beltsville, MD, 20705, USA
| | - Matthew S Tucker
- USDA-ARS Animal Parasitic Diseases Laboratory, Beltsville Agricultural Research Center, BARC-East Building 1040, 10300 Baltimore Ave., Beltsville, MD, 20705, USA
| | - Peter C Thompson
- USDA-ARS Animal Parasitic Diseases Laboratory, Beltsville Agricultural Research Center, BARC-East Building 1040, 10300 Baltimore Ave., Beltsville, MD, 20705, USA
| | - Benjamin M Rosenthal
- USDA-ARS Animal Parasitic Diseases Laboratory, Beltsville Agricultural Research Center, BARC-East Building 1040, 10300 Baltimore Ave., Beltsville, MD, 20705, USA
| | - Jitender P Dubey
- USDA-ARS Animal Parasitic Diseases Laboratory, Beltsville Agricultural Research Center, BARC-East Building 1040, 10300 Baltimore Ave., Beltsville, MD, 20705, USA
| | - Asis Khan
- USDA-ARS Animal Parasitic Diseases Laboratory, Beltsville Agricultural Research Center, BARC-East Building 1040, 10300 Baltimore Ave., Beltsville, MD, 20705, USA
| | - Mark C Jenkins
- USDA-ARS Animal Parasitic Diseases Laboratory, Beltsville Agricultural Research Center, BARC-East Building 1040, 10300 Baltimore Ave., Beltsville, MD, 20705, USA.
| |
Collapse
|
7
|
Narh Mensah DL, Wingfield BD, Coetzee MP. A practical approach to genome assembly and annotation of Basidiomycota using the example of Armillaria. Biotechniques 2023; 75:115-128. [PMID: 37681497 DOI: 10.2144/btn-2023-0023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023] Open
Abstract
Technological advancements in genome sequencing, assembly and annotation platforms and algorithms that resulted in several genomic studies have created an opportunity to further our understanding of the biology of phytopathogens, including Armillaria species. Most Armillaria species are facultative necrotrophs that cause root- and stem-rot, usually on woody plants, significantly impacting agriculture and forestry worldwide. Genome sequencing, assembly and annotation in terms of samples used and methods applied in Armillaria genome projects are evaluated in this review. Infographic guidelines and a database of resources to facilitate future Armillaria genome projects were developed. Knowledge gained from genomic studies of Armillaria species is summarized and prospects for further research are provided. This guide can be applied to other diploid and dikaryotic fungal genomics.
Collapse
Affiliation(s)
- Deborah L Narh Mensah
- Department of Biochemistry, Genetics & Microbiology, Forestry & Agricultural Biotechnology Institute (FABI), Faculty of Natural & Agricultural Sciences, University of Pretoria, Pretoria, Gauteng, South Africa
- Council for Scientific and Industrial Research - Food Research Institute (CSIR-FRI), PO Box M20, Accra, Ghana
| | - Brenda D Wingfield
- Department of Biochemistry, Genetics & Microbiology, Forestry & Agricultural Biotechnology Institute (FABI), Faculty of Natural & Agricultural Sciences, University of Pretoria, Pretoria, Gauteng, South Africa
| | - Martin Pa Coetzee
- Department of Biochemistry, Genetics & Microbiology, Forestry & Agricultural Biotechnology Institute (FABI), Faculty of Natural & Agricultural Sciences, University of Pretoria, Pretoria, Gauteng, South Africa
| |
Collapse
|
8
|
Ibañez-Lligoña M, Colomer-Castell S, González-Sánchez A, Gregori J, Campos C, Garcia-Cehic D, Andrés C, Piñana M, Pumarola T, Rodríguez-Frias F, Antón A, Quer J. Bioinformatic Tools for NGS-Based Metagenomics to Improve the Clinical Diagnosis of Emerging, Re-Emerging and New Viruses. Viruses 2023; 15:v15020587. [PMID: 36851800 PMCID: PMC9965957 DOI: 10.3390/v15020587] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 02/16/2023] [Accepted: 02/17/2023] [Indexed: 02/24/2023] Open
Abstract
Epidemics and pandemics have occurred since the beginning of time, resulting in millions of deaths. Many such disease outbreaks are caused by viruses. Some viruses, particularly RNA viruses, are characterized by their high genetic variability, and this can affect certain phenotypic features: tropism, antigenicity, and susceptibility to antiviral drugs, vaccines, and the host immune response. The best strategy to face the emergence of new infectious genomes is prompt identification. However, currently available diagnostic tests are often limited for detecting new agents. High-throughput next-generation sequencing technologies based on metagenomics may be the solution to detect new infectious genomes and properly diagnose certain diseases. Metagenomic techniques enable the identification and characterization of disease-causing agents, but they require a large amount of genetic material and involve complex bioinformatic analyses. A wide variety of analytical tools can be used in the quality control and pre-processing of metagenomic data, filtering of untargeted sequences, assembly and quality control of reads, and taxonomic profiling of sequences to identify new viruses and ones that have been sequenced and uploaded to dedicated databases. Although there have been huge advances in the field of metagenomics, there is still a lack of consensus about which of the various approaches should be used for specific data analysis tasks. In this review, we provide some background on the study of viral infections, describe the contribution of metagenomics to this field, and place special emphasis on the bioinformatic tools (with their capabilities and limitations) available for use in metagenomic analyses of viral pathogens.
Collapse
Affiliation(s)
- Marta Ibañez-Lligoña
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry and Molecular Biology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
| | - Sergi Colomer-Castell
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry and Molecular Biology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
| | - Alejandra González-Sánchez
- Microbiology Department, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
| | - Josep Gregori
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
| | - Carolina Campos
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry and Molecular Biology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
| | - Damir Garcia-Cehic
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
| | - Cristina Andrés
- Microbiology Department, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
| | - Maria Piñana
- Microbiology Department, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
| | - Tomàs Pumarola
- Microbiology Department, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Microbiology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
| | - Francisco Rodríguez-Frias
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Department of Basic Sciences, Universitat Internacional de Catalunya, Sant Cugat del Vallès, 08195 Barcelona, Spain
| | - Andrés Antón
- Microbiology Department, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Microbiology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
| | - Josep Quer
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry and Molecular Biology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
- Correspondence:
| |
Collapse
|
9
|
Muñoz-Barrera A, Rubio-Rodríguez LA, Díaz-de Usera A, Jáspez D, Lorenzo-Salazar JM, González-Montelongo R, García-Olivares V, Flores C. From Samples to Germline and Somatic Sequence Variation: A Focus on Next-Generation Sequencing in Melanoma Research. Life (Basel) 2022; 12:1939. [PMID: 36431075 PMCID: PMC9695713 DOI: 10.3390/life12111939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 11/12/2022] [Accepted: 11/16/2022] [Indexed: 11/24/2022] Open
Abstract
Next-generation sequencing (NGS) applications have flourished in the last decade, permitting the identification of cancer driver genes and profoundly expanding the possibilities of genomic studies of cancer, including melanoma. Here we aimed to present a technical review across many of the methodological approaches brought by the use of NGS applications with a focus on assessing germline and somatic sequence variation. We provide cautionary notes and discuss key technical details involved in library preparation, the most common problems with the samples, and guidance to circumvent them. We also provide an overview of the sequence-based methods for cancer genomics, exposing the pros and cons of targeted sequencing vs. exome or whole-genome sequencing (WGS), the fundamentals of the most common commercial platforms, and a comparison of throughputs and key applications. Details of the steps and the main software involved in the bioinformatics processing of the sequencing results, from preprocessing to variant prioritization and filtering, are also provided in the context of the full spectrum of genetic variation (SNVs, indels, CNVs, structural variation, and gene fusions). Finally, we put the emphasis on selected bioinformatic pipelines behind (a) short-read WGS identification of small germline and somatic variants, (b) detection of gene fusions from transcriptomes, and (c) de novo assembly of genomes from long-read WGS data. Overall, we provide comprehensive guidance across the main methodological procedures involved in obtaining sequencing results for the most common short- and long-read NGS platforms, highlighting key applications in melanoma research.
Collapse
Affiliation(s)
- Adrián Muñoz-Barrera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Luis A. Rubio-Rodríguez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Ana Díaz-de Usera
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, 38010 Santa Cruz de Tenerife, Spain
| | - David Jáspez
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - José M. Lorenzo-Salazar
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Rafaela González-Montelongo
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Víctor García-Olivares
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
| | - Carlos Flores
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, 38010 Santa Cruz de Tenerife, Spain
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, 28029 Madrid, Spain
- Facultad de Ciencias de la Salud, Universidad Fernando de Pessoa Canarias, 35450 Las Palmas de Gran Canaria, Spain
| |
Collapse
|
10
|
Schaal W, Ameur A, Olsson-Strömberg U, Hermanson M, Cavelier L, Spjuth O. Migrating to Long-Read Sequencing for Clinical Routine BCR-ABL1 TKI Resistance Mutation Screening. Cancer Inform 2022; 21:11769351221110872. [PMID: 35860345 PMCID: PMC9290162 DOI: 10.1177/11769351221110872] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Accepted: 05/22/2022] [Indexed: 11/15/2022] Open
Abstract
Objective The aim of this project was to implement long-read sequencing for BCR-ABL1 TKI resistance mutation screening in a clinical setting for patients undergoing treatment for chronic myeloid leukemia. Materials and Methods Processes were established for registering and transferring samples from the clinic to an academic sequencing facility for long-read sequencing. An automated analysis pipeline for detecting mutations was established, and an information system was implemented comprising features for data management, analysis and visualization. Clinical validation was performed by identifying BCR-ABL1 TKI resistance mutations by Sanger and long-read sequencing in parallel. The developed software is available as open source via GitHub at https://github.com/pharmbio/clamp. Results The information system enabled traceable transfer of samples from the clinic to the sequencing facility, robust and automated analysis of the long-read sequence data, and communication of results from sequence analysis in a reporting format that could be easily interpreted and acted upon by clinical experts. In a validation study, all 17 resistance mutations found by Sanger sequencing were also detected by long-read sequencing. An additional 16 mutations were found only by long-read sequencing, all of them with frequencies below the limit of detection for Sanger sequencing. The clonal distributions of co-existing mutations were automatically resolved through the long-read data analysis. After the implementation and validation, the clinical laboratory switched their routine protocol from using Sanger to long-read sequencing for this application. Conclusions Long-read sequencing delivers results with higher sensitivity compared to Sanger sequencing and enables earlier detection of emerging TKI resistance mutations. The developed processes, analysis workflow, and software components lower barriers for adoption and could be extended to other applications.
Collapse
Affiliation(s)
- Wesley Schaal
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden.,Pincer Bio AB, Uppsala, Sweden
| | - Adam Ameur
- Pincer Bio AB, Uppsala, Sweden.,Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | | | - Monica Hermanson
- Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Lucia Cavelier
- Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden.,Pincer Bio AB, Uppsala, Sweden
| |
Collapse
|
11
|
Nwachukwu BC, Babalola OO. Metagenomics: A Tool for Exploring Key Microbiome With the Potentials for Improving Sustainable Agriculture. FRONTIERS IN SUSTAINABLE FOOD SYSTEMS 2022. [DOI: 10.3389/fsufs.2022.886987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Microorganisms are immense in nature and exist in every imaginable ecological niche, performing a wide range of metabolic processes. Unfortunately, using traditional microbiological methods, most microorganisms remain unculturable. The emergence of metagenomics has resolved the challenge of capturing the entire microbial community in an environmental sample by enabling the analysis of whole genomes without requiring culturing. Metagenomics as a non-culture approach encompasses a greater amount of genetic information than traditional approaches. The plant root-associated microbial community is essential for plant growth and development, hence the interactions between microorganisms, soil, and plants is essential to understand and improve crop yields in rural and urban agriculture. Although some of these microorganisms are currently unculturable in the laboratory, metagenomic techniques may nevertheless be used to identify the microorganisms and their functional traits. A detailed understanding of these organisms and their interactions should facilitate an improvement of plant growth and sustainable crop production in soil and soilless agriculture. Therefore, the objective of this review is to provide insights into metagenomic techniques to study plant root-associated microbiota and microbial ecology. In addition, the different DNA-based techniques and their role in elaborating plant microbiomes are discussed. As an understanding of these microorganisms and their biotechnological potentials are unlocked through metagenomics, they can be used to develop new, useful and unique bio-fertilizers and bio-pesticides that are not harmful to the environment.
Collapse
|
12
|
Dasgupta MG, Parveen AM, Rajasugunasekar D, Ulaganathan K. Wood transcriptome analysis and expression variation of lignin biosynthetic pathway transcripts in Ailanthus excelsa Roxb., a multi-purpose tropical tree species. J Biosci 2021. [DOI: 10.1007/s12038-021-00218-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
13
|
|
14
|
Sarkar A, Al-Ars Z, Bertels K. QuASeR: Quantum Accelerated de novo DNA sequence reconstruction. PLoS One 2021; 16:e0249850. [PMID: 33844699 PMCID: PMC8041170 DOI: 10.1371/journal.pone.0249850] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2020] [Accepted: 03/24/2021] [Indexed: 01/10/2023] Open
Abstract
In this article, we present QuASeR, a reference-free DNA sequence reconstruction implementation via de novo assembly on both gate-based and quantum annealing platforms. This is the first time this important application in bioinformatics is modeled using quantum computation. Each one of the four steps of the implementation (TSP, QUBO, Hamiltonians and QAOA) is explained with a proof-of-concept example to target both the genomics research community and quantum application developers in a self-contained manner. The implementation and results on executing the algorithm from a set of DNA reads to a reconstructed sequence, on a gate-based quantum simulator, the D-Wave quantum annealing simulator and hardware are detailed. We also highlight the limitations of current classical simulation and available quantum hardware systems. The implementation is open-source and can be found on https://github.com/QE-Lab/QuASeR.
Collapse
Affiliation(s)
- Aritra Sarkar
- Department of Quantum and Computer Engineering, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Zaid Al-Ars
- Department of Quantum and Computer Engineering, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Koen Bertels
- Department of Informatics Engineering, Faculty of Engineering, University of Porto, Porto, Portugal
| |
Collapse
|
15
|
Cortese IJ, Castrillo ML, Onetto AL, Bich GÁ, Zapata PD, Laczeski ME. De novo genome assembly of Bacillus altitudinis 19RS3 and Bacillus altitudinis T5S-T4, two plant growth-promoting bacteria isolated from Ilex paraguariensis St. Hil. (yerba mate). PLoS One 2021; 16:e0248274. [PMID: 33705487 PMCID: PMC7954119 DOI: 10.1371/journal.pone.0248274] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Accepted: 02/23/2021] [Indexed: 11/18/2022] Open
Abstract
Plant growth-promoting bacteria (PGPB) are a heterogeneous group of bacteria that can exert beneficial effects on plant growth directly or indirectly by different mechanisms. PGPB-based inoculant formulation has been used to replace chemical fertilizers and pesticides. In our previous studies, two endophytic endospore-forming bacteria identified as Bacillus altitudinis were isolated from roots of Ilex paraguariensis St. Hil. seedlings and selected for their plant growth-promoting (PGP) properties shown in vitro and in vivo. The purposes of this work were to assemble the genomes of B. altitudinis 19RS3 and T5S-T4, using different assemblers available for Windows and Linux and to select the best assembly for each strain. Both genomes were also automatically annotated to detect PGP genes and compare sequences with other genomes reported. Library construction and draft genome sequencing were performed by Macrogen services. Raw reads were filtered using the Trimmomatic tool. Genomes were assembled using SPAdes, ABySS, Velvet, and SOAPdenovo2 assemblers for Linux, and Geneious and CLC Genomics Workbench assemblers for Windows. Assembly evaluation was done by the QUAST tool. The parameters evaluated were the number of contigs ≥ 500 bp and ≥ 1000 bp, the length of the longest contig, and the N50 value. For genome annotation PROKKA, RAST, and KAAS tools were used. The best assembly for both genomes was obtained using Velvet. The B. altitudinis 19RS3 genome was assembled into 15 contigs with an N50 value of 1,943,801 bp. The B. altitudinis T5S-T4 genome was assembled into 24 contigs with an N50 of 344,151 bp. Both genomes comprise several genes related to PGP mechanisms, such as those for nitrogen fixation, iron metabolism, phosphate metabolism, and auxin biosynthesis. The results obtained offer the basis for a better understanding of B. altitudinis 19RS3 and T5S-T4 and make them promissory for bioinoculant development.
Collapse
Affiliation(s)
- Iliana Julieta Cortese
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| | - María Lorena Castrillo
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| | - Andrea Liliana Onetto
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| | - Gustavo Ángel Bich
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| | - Pedro Darío Zapata
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| | - Margarita Ester Laczeski
- Laboratorio de Biotecnología Molecular, Instituto de Biotecnología
Misiones “Dra. María Ebe Reca” (InBioMis), CONICET, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
- Cátedra de Bacteriología, Dpto. de Microbiología, Facultad de Ciencias
Exactas, Químicas y Naturales/FCEQyN, Universidad Nacional de Misiones/UNaM,
Posadas, Misiones, Argentina
| |
Collapse
|
16
|
Brandies PA, Hogg CJ. Ten simple rules for getting started with command-line bioinformatics. PLoS Comput Biol 2021; 17:e1008645. [PMID: 33600404 PMCID: PMC7891784 DOI: 10.1371/journal.pcbi.1008645] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Affiliation(s)
- Parice A. Brandies
- School of Life and Environmental Sciences, Faculty of Science, The University of Sydney, Sydney, New South Wales, Australia
| | - Carolyn J. Hogg
- School of Life and Environmental Sciences, Faculty of Science, The University of Sydney, Sydney, New South Wales, Australia
- * E-mail:
| |
Collapse
|
17
|
Evolutionary Dynamics of the Pericentromeric Heterochromatin in Drosophila virilis and Related Species. Genes (Basel) 2021; 12:genes12020175. [PMID: 33513919 PMCID: PMC7911463 DOI: 10.3390/genes12020175] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 01/21/2021] [Accepted: 01/23/2021] [Indexed: 12/19/2022] Open
Abstract
Pericentromeric heterochromatin in Drosophila generally consists of repetitive DNA, forming the environment associated with gene silencing. Despite the expanding knowledge of the impact of transposable elements (TEs) on the host genome, little is known about the evolution of pericentromeric heterochromatin, its structural composition, and age. During the evolution of the Drosophilidae, hundreds of genes have become embedded within pericentromeric regions yet retained activity. We investigated a pericentromeric heterochromatin fragment found in D. virilis and related species, describing the evolution of genes in this region and the age of TE invasion. Regardless of the heterochromatic environment, the amino acid composition of the genes is under purifying selection. However, the selective pressure affects parts of genes in varying degrees, resulting in expansion of gene introns due to TEs invasion. According to the divergence of TEs, the pericentromeric heterochromatin of the species of virilis group began to form more than 20 million years ago by invasions of retroelements, miniature inverted repeat transposable elements (MITEs), and Helitrons. Importantly, invasions into the heterochromatin continue to occur by TEs that fall under the scope of piRNA silencing. Thus, the pericentromeric heterochromatin, in spite of its ability to induce silencing, has the means for being dynamic, incorporating the regions of active transcription.
Collapse
|
18
|
A Customizable Analysis Flow in Integrative Multi-Omics. Biomolecules 2020; 10:biom10121606. [PMID: 33260881 PMCID: PMC7760368 DOI: 10.3390/biom10121606] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 11/20/2020] [Accepted: 11/23/2020] [Indexed: 12/21/2022] Open
Abstract
The number of researchers using multi-omics is growing. Though still expensive, every year it is cheaper to perform multi-omic studies, often exponentially so. In addition to its increasing accessibility, multi-omics reveals a view of systems biology to an unprecedented depth. Thus, multi-omics can be used to answer a broad range of biological questions in finer resolution than previous methods. We used six omic measurements—four nucleic acid (i.e., genomic, epigenomic, transcriptomics, and metagenomic) and two mass spectrometry (proteomics and metabolomics) based—to highlight an analysis workflow on this type of data, which is often vast. This workflow is not exhaustive of all the omic measurements or analysis methods, but it will provide an experienced or even a novice multi-omic researcher with the tools necessary to analyze their data. This review begins with analyzing a single ome and study design, and then synthesizes best practices in data integration techniques that include machine learning. Furthermore, we delineate methods to validate findings from multi-omic integration. Ultimately, multi-omic integration offers a window into the complexity of molecular interactions and a comprehensive view of systems biology.
Collapse
|
19
|
Castro CJ, Marine RL, Ramos E, Ng TFF. The effect of variant interference on de novo assembly for viral deep sequencing. BMC Genomics 2020; 21:421. [PMID: 32571214 PMCID: PMC7306937 DOI: 10.1186/s12864-020-06801-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 06/02/2020] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Viruses have high mutation rates and generally exist as a mixture of variants in biological samples. Next-generation sequencing (NGS) approaches have surpassed Sanger for generating long viral sequences, yet how variants affect NGS de novo assembly remains largely unexplored. RESULTS Our results from > 15,000 simulated experiments showed that presence of variants can turn an assembly of one genome into tens to thousands of contigs. This "variant interference" (VI) is highly consistent and reproducible by ten commonly-used de novo assemblers, and occurs over a range of genome length, read length, and GC content. The main driver of VI is pairwise identities between viral variants. These findings were further supported by in silico simulations, where selective removal of minor variant reads from clinical datasets allow the "rescue" of full viral genomes from fragmented contigs. CONCLUSIONS These results call for careful interpretation of contigs and contig numbers from de novo assembly in viral deep sequencing.
Collapse
Affiliation(s)
- Christina J Castro
- Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, 30329, USA.,Oak Ridge Institute for Science and Education, Oak Ridge, TN, USA
| | - Rachel L Marine
- Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, 30329, USA
| | - Edward Ramos
- General Dynamics Information Technology, Inc., contracting agency to the Office of Informatics, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Falls Church, VA, USA
| | - Terry Fei Fan Ng
- Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, 30329, USA.
| |
Collapse
|
20
|
instaGRAAL: chromosome-level quality scaffolding of genomes using a proximity ligation-based scaffolder. Genome Biol 2020; 21:148. [PMID: 32552806 PMCID: PMC7386250 DOI: 10.1186/s13059-020-02041-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Accepted: 05/11/2020] [Indexed: 02/06/2023] Open
Abstract
Hi-C exploits contact frequencies between pairs of loci to bridge and order contigs during genome assembly, resulting in chromosome-level assemblies. Because few robust programs are available for this type of data, we developed instaGRAAL, a complete overhaul of the GRAAL program, which has adapted the latter to allow efficient assembly of large genomes. instaGRAAL features a number of improvements over GRAAL, including a modular correction approach that optionally integrates independent data. We validate the program using data for two brown algae, and human, to generate near-complete assemblies with minimal human intervention.
Collapse
|
21
|
Pasquali F, Do Valle I, Palma F, Remondini D, Manfreda G, Castellani G, Hendriksen RS, De Cesare A. Application of different DNA extraction procedures, library preparation protocols and sequencing platforms: impact on sequencing results. Heliyon 2019; 5:e02745. [PMID: 31720479 PMCID: PMC6838873 DOI: 10.1016/j.heliyon.2019.e02745] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 04/01/2019] [Accepted: 10/25/2019] [Indexed: 01/22/2023] Open
Abstract
In this study three DNA extraction procedures, two library preparation protocols and two sequencing platforms were applied to analyse six bacterial cultures and their corresponding DNA obtained as part of a proficiency test. The impact of each variable on sequencing results was assessed using the following parameters: reads quality, assembly and alignment statistics; number of single nucleotide polymorphisms (SNPs), detected applying assembly- and alignment-based strategies; antimicrobial resistance genes (ARGs), identified on de novo assemblies of all sequenced genomes. The investigated nucleic acid extraction procedures, library preparation kits and sequencing platforms do not significantly affect de novo assembly statistics and number of SNPs and ARGs. The only exception was observed for two duplicates, which were associated to one PCR-based library preparation kit. Results from this comparative study can support researchers in the choice toward the available pre-sequencing and sequencing options, and might suggest further comparisons to be performed.
Collapse
Affiliation(s)
- F Pasquali
- Department of Food and Agricultural Sciences, Alma Mater Studiorum-University of Bologna, via del Florio 2, Ozzano dell'Emilia, 40064 Italy
| | - I Do Valle
- Department of Physics, Northeastern University, 360 Huntington Avenue, Boston, MA, 02115-5000, USA
| | - F Palma
- Department of Food and Agricultural Sciences, Alma Mater Studiorum-University of Bologna, via del Florio 2, Ozzano dell'Emilia, 40064 Italy
| | - D Remondini
- Department of Physics and Astronomy, Alma Mater Studiorum-University of Bologna, viale Berti Pichat 6/2, 40127, Bologna, Italy
| | - G Manfreda
- Department of Food and Agricultural Sciences, Alma Mater Studiorum-University of Bologna, via del Florio 2, Ozzano dell'Emilia, 40064 Italy
| | - G Castellani
- Department of Physics and Astronomy, Alma Mater Studiorum-University of Bologna, viale Berti Pichat 6/2, 40127, Bologna, Italy
| | - R S Hendriksen
- Technical University of Denmark, Kemitorvet, Kgs. Lyngby, 2800, Denmark
| | - A De Cesare
- Department of Food and Agricultural Sciences, Alma Mater Studiorum-University of Bologna, via del Florio 2, Ozzano dell'Emilia, 40064 Italy
| |
Collapse
|
22
|
Shajii A, Numanagić I, Baghdadi R, Berger B, Amarasinghe S. Seq: A High-Performance Language for Bioinformatics. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES 2019; 3:125. [PMID: 35775031 PMCID: PMC9241673 DOI: 10.1145/3360551] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The scope and scale of biological data are increasing at an exponential rate, as technologies like next-generation sequencing are becoming radically cheaper and more prevalent. Over the last two decades, the cost of sequencing a genome has dropped from $100 million to nearly $100-a factor of over 106-and the amount of data to be analyzed has increased proportionally. Yet, as Moore's Law continues to slow, computational biologists can no longer rely on computing hardware to compensate for the ever-increasing size of biological datasets. In a field where many researchers are primarily focused on biological analysis over computational optimization, the unfortunate solution to this problem is often to simply buy larger and faster machines. Here, we introduce Seq, the first language tailored specifically to bioinformatics, which marries the ease and productivity of Python with C-like performance. Seq starts with a subset of Python-and is in many cases a drop-in replacement-yet also incorporates novel bioinformatics- and computational genomics-oriented data types, language constructs and optimizations. Seq enables users to write high-level, Pythonic code without having to worry about low-level or domain-specific optimizations, and allows for the seamless expression of the algorithms, idioms and patterns found in many genomics or bioinformatics applications. We evaluated Seq on several standard computational genomics tasks like reverse complementation, k-mer manipulation, sequence pattern matching and large genomic index queries. On equivalent CPython code, Seq attains a performance improvement of up to two orders of magnitude, and a 160× improvement once domain-specific language features and optimizations are used. With parallelism, we demonstrate up to a 650× improvement. Compared to optimized C++ code, which is already difficult for most biologists to produce, Seq frequently attains up to a 2× improvement, and with shorter, cleaner code. Thus, Seq opens the door to an age of democratization of highly-optimized bioinformatics software.
Collapse
Affiliation(s)
- Ariya Shajii
- MIT CSAIL, 77 Massachusetts Ave, Cambridge, MA, 02139, USA
| | | | | | - Bonnie Berger
- MIT CSAIL, 77 Massachusetts Ave, Cambridge, MA, 02139, USA
| | | |
Collapse
|
23
|
Mostajo NF, Lataretu M, Krautwurst S, Mock F, Desirò D, Lamkiewicz K, Collatz M, Schoen A, Weber F, Marz M, Hölzer M. A comprehensive annotation and differential expression analysis of short and long non-coding RNAs in 16 bat genomes. NAR Genom Bioinform 2019; 2:lqz006. [PMID: 32289119 PMCID: PMC7108008 DOI: 10.1093/nargab/lqz006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 08/21/2019] [Accepted: 09/10/2019] [Indexed: 12/25/2022] Open
Abstract
Although bats are increasingly becoming the focus of scientific studies due to their unique properties, these exceptional animals are still among the least studied mammals. Assembly quality and completeness of bat genomes vary a lot and especially non-coding RNA (ncRNA) annotations are incomplete or simply missing. Accordingly, standard bioinformatics pipelines for gene expression analysis often ignore ncRNAs such as microRNAs or long antisense RNAs. The main cause of this problem is the use of incomplete genome annotations. We present a complete screening for ncRNAs within 16 bat genomes. NcRNAs affect a remarkable variety of vital biological functions, including gene expression regulation, RNA processing, RNA interference and, as recently described, regulatory processes in viral infections. Within all investigated bat assemblies, we annotated 667 ncRNA families including 162 snoRNAs and 193 miRNAs as well as rRNAs, tRNAs, several snRNAs and lncRNAs, and other structural ncRNA elements. We validated our ncRNA candidates by six RNA-Seq data sets and show significant expression patterns that have never been described before in a bat species on such a large scale. Our annotations will be usable as a resource (rna.uni-jena.de/supplements/bats) for deeper studying of bat evolution, ncRNAs repertoire, gene expression and regulation, ecology and important host–virus interactions.
Collapse
Affiliation(s)
- Nelly F Mostajo
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,Institute of Virology, Philipps-University Marburg, Hans-Meerwein-Straße 2, 35043 Marburg, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
| | - Marie Lataretu
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
| | - Sebastian Krautwurst
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
| | - Florian Mock
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
| | - Daniel Desirò
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
| | - Kevin Lamkiewicz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
| | - Maximilian Collatz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
| | - Andreas Schoen
- Institute for Virology, FB10-Veterinary Medicine, Justus-Liebig University, 35392 Gießen, Germany.,German Center for Infection Research (DZIF), partner sites 35043 Marburg and 35392 Gießen, Germany
| | - Friedemann Weber
- Institute of Virology, Philipps-University Marburg, Hans-Meerwein-Straße 2, 35043 Marburg, Germany.,Institute for Virology, FB10-Veterinary Medicine, Justus-Liebig University, 35392 Gießen, Germany.,German Center for Infection Research (DZIF), partner sites 35043 Marburg and 35392 Gießen, Germany
| | - Manja Marz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,FLI Leibniz Institute for Age Research, Beutenbergstraße 11, 07745 Jena, Germany
| | - Martin Hölzer
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,European Virus Bioinformatics Center, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
| |
Collapse
|
24
|
GAAP: A Genome Assembly + Annotation Pipeline. BIOMED RESEARCH INTERNATIONAL 2019; 2019:4767354. [PMID: 31346518 PMCID: PMC6617929 DOI: 10.1155/2019/4767354] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 05/20/2019] [Accepted: 05/26/2019] [Indexed: 12/24/2022]
Abstract
Genomic analysis begins with de novo assembly of short-read fragments in order to reconstruct full-length base sequences without exploiting a reference genome sequence. Then, in the annotation step, gene locations are identified within the base sequences, and the structures and functions of these genes are determined. Recently, a wide range of powerful tools have been developed and published for whole-genome analysis, enabling even individual researchers in small laboratories to perform whole-genome analyses on their objects of interest. However, these analytical tools are generally complex and use diverse algorithms, parameter setting methods, and input formats; thus, it remains difficult for individual researchers to select, utilize, and combine these tools to obtain their final results. To resolve these issues, we have developed a genome analysis pipeline (GAAP) for semiautomated, iterative, and high-throughput analysis of whole-genome data. This pipeline is designed to perform read correction, de novo genome (transcriptome) assembly, gene prediction, and functional annotation using a range of proven tools and databases. We aim to assist non-IT researchers by describing each stage of analysis in detail and discussing current approaches. We also provide practical advice on how to access and use the bioinformatics tools and databases and how to implement the provided suggestions. Whole-genome analysis of Toxocara canis is used as case study to show intermediate results at each stage, demonstrating the practicality of the proposed method.
Collapse
|
25
|
Vanderlinde T, Dupim EG, Nazario-Yepiz NO, Carvalho AB. An Improved Genome Assembly for Drosophila navojoa, the Basal Species in the mojavensis Cluster. J Hered 2019; 110:118-123. [PMID: 30423125 PMCID: PMC6321958 DOI: 10.1093/jhered/esy059] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2018] [Accepted: 11/12/2018] [Indexed: 12/30/2022] Open
Abstract
Three North American cactophilic Drosophila species, D. mojavensis, D. arizonae, and D. navojoa, are of considerable evolutionary interest owing to the shift from breeding in Opuntia cacti to columnar species. The 3 species form the "mojavensis cluster" of Drosophila. The genome of D. mojavensis was sequenced in 2007 and the genomes of D. navojoa and D. arizonae were sequenced together in 2016 using the same technology (Illumina) and assembly software (AllPaths-LG). Yet, unfortunately, the D. navojoa genome was considerably more fragmented and incomplete than its sister species, rendering it less useful for evolutionary genetic studies. The D. navojoa read dataset does not fully meet the strict insert size required by the assembler used (AllPaths-LG) and this incompatibility might explain its assembly problems. Accordingly, when we re-assembled the genome of D. navojoa with the SPAdes assembler, which does not have the strict AllPaths-LG requirements, we obtained a substantial improvement in all quality indicators such as N50 (from 84 kb to 389 kb) and BUSCO coverage (from 77% to 97%). Here we share a new, improved reference assembly for D. navojoa genome, along with a RNAseq transcriptome. Given the basal relationship of the Opuntia breeding D. navojoa to the columnar breeding D. arizonae and D. mojavensis, the improved assembly and annotation will allow researchers to address a range of questions associated with the genomics of host shifts, chromosomal rearrangements and speciation in this group.
Collapse
Affiliation(s)
- Thyago Vanderlinde
- Departamento de Genética, Instituto de Biologia, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| | - Eduardo Guimarães Dupim
- Departamento de Genética, Instituto de Biologia, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| | - Nestor O Nazario-Yepiz
- Laboratorio Nacional de la Genómica para la Biodiversidad, Centro de Investigación y Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV), Irapuato, Guanajuato, México
| | - Antonio Bernardo Carvalho
- Departamento de Genética, Instituto de Biologia, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| |
Collapse
|