1
|
Vello F, Filippini F, Righetto I. Bioinformatics Goes Viral: I. Databases, Phylogenetics and Phylodynamics Tools for Boosting Virus Research. Viruses 2024; 16:1425. [PMID: 39339901 PMCID: PMC11437414 DOI: 10.3390/v16091425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 08/21/2024] [Accepted: 09/03/2024] [Indexed: 09/30/2024] Open
Abstract
Computer-aided analysis of proteins or nucleic acids seems like a matter of course nowadays; however, the history of Bioinformatics and Computational Biology is quite recent. The advent of high-throughput sequencing has led to the production of "big data", which has also affected the field of virology. The collaboration between the communities of bioinformaticians and virologists already started a few decades ago and it was strongly enhanced by the recent SARS-CoV-2 pandemics. In this article, which is the first in a series on how bioinformatics can enhance virus research, we show that highly useful information is retrievable from selected general and dedicated databases. Indeed, an enormous amount of information-both in terms of nucleotide/protein sequences and their annotation-is deposited in the general databases of international organisations participating in the International Nucleotide Sequence Database Collaboration (INSDC). However, more and more virus-specific databases have been established and are progressively enriched with the contents and features reported in this article. Since viruses are intracellular obligate parasites, a special focus is given to host-pathogen protein-protein interaction databases. Finally, we illustrate several phylogenetic and phylodynamic tools, combining information on algorithms and features with practical information on how to use them and case studies that validate their usefulness. Databases and tools for functional inference will be covered in the next article of this series: Bioinformatics goes viral: II. Sequence-based and structure-based functional analyses for boosting virus research.
Collapse
Affiliation(s)
| | - Francesco Filippini
- Synthetic Biology and Biotechnology Unit, Department of Biology, University of Padua, 35131 Padua, Italy; (F.V.); (I.R.)
| | | |
Collapse
|
2
|
Tournayre J, Polonais V, Wawrzyniak I, Akossi RF, Parisot N, Lerat E, Delbac F, Souvignet P, Reichstadt M, Peyretaillade E. MicroAnnot: A Dedicated Workflow for Accurate Microsporidian Genome Annotation. Int J Mol Sci 2024; 25:880. [PMID: 38255958 PMCID: PMC10815200 DOI: 10.3390/ijms25020880] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 12/29/2023] [Accepted: 01/04/2024] [Indexed: 01/24/2024] Open
Abstract
With nearly 1700 species, Microsporidia represent a group of obligate intracellular eukaryotes with veterinary, economic and medical impacts. To help understand the biological functions of these microorganisms, complete genome sequencing is routinely used. Nevertheless, the proper prediction of their gene catalogue is challenging due to their taxon-specific evolutionary features. As innovative genome annotation strategies are needed to obtain a representative snapshot of the overall lifestyle of these parasites, the MicroAnnot tool, a dedicated workflow for microsporidian sequence annotation using data from curated databases of accurately annotated microsporidian genes, has been developed. Furthermore, specific modules have been implemented to perform small gene (<300 bp) and transposable element identification. Finally, functional annotation was performed using the signature-based InterProScan software. MicroAnnot's accuracy has been verified by the re-annotation of four microsporidian genomes for which structural annotation had previously been validated. With its comparative approach and transcriptional signal identification method, MicroAnnot provides an accurate prediction of translation initiation sites, an efficient identification of transposable elements, as well as high specificity and sensitivity for microsporidian genes, including those under 300 bp.
Collapse
Affiliation(s)
- Jérémy Tournayre
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, 63122 Saint-Genès-Champanelle, France; (J.T.); (P.S.); (M.R.)
| | - Valérie Polonais
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Ivan Wawrzyniak
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Reginald Florian Akossi
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Nicolas Parisot
- UMR 203, BF2I, INRAE, INSA Lyon, Université de Lyon, 69621 Villeurbanne, France
| | - Emmanuelle Lerat
- VAS, CNRS, UMR5558, LBBE, Université Claude Bernard Lyon 1, 69622 Villeurbanne, France;
| | - Frédéric Delbac
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Pierre Souvignet
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, 63122 Saint-Genès-Champanelle, France; (J.T.); (P.S.); (M.R.)
| | - Matthieu Reichstadt
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, 63122 Saint-Genès-Champanelle, France; (J.T.); (P.S.); (M.R.)
| | - Eric Peyretaillade
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| |
Collapse
|
3
|
Brůna T, Li H, Guhlin J, Honsel D, Herbold S, Stanke M, Nenasheva N, Ebel M, Gabriel L, Hoff KJ. Galba: genome annotation with miniprot and AUGUSTUS. BMC Bioinformatics 2023; 24:327. [PMID: 37653395 PMCID: PMC10472564 DOI: 10.1186/s12859-023-05449-z] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 08/21/2023] [Indexed: 09/02/2023] Open
Abstract
BACKGROUND The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. RESULTS Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. CONCLUSIONS Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.
Collapse
Affiliation(s)
- Tomáš Brůna
- U.S. Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, 02215, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02215, MA, USA
| | - Joseph Guhlin
- Genomics Aotearoa and Laboratory for Evolution and Development, Department of Biochemistry, University of Otago, Dunedin, 9016, New Zealand
| | - Daniel Honsel
- Institute of Computer Science, University of Göttingen, 37077, Göttingen, Germany
| | - Steffen Herbold
- Faculty for Computer Science and Mathematics, University of Passau, 94032, Passau, Germany
| | - Mario Stanke
- Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489, Greifswald, Germany
| | - Natalia Nenasheva
- Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489, Greifswald, Germany
| | - Matthis Ebel
- Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489, Greifswald, Germany
| | - Lars Gabriel
- Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489, Greifswald, Germany
| | - Katharina J Hoff
- Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489, Greifswald, Germany.
| |
Collapse
|
4
|
Brůna T, Li H, Guhlin J, Honsel D, Herbold S, Stanke M, Nenasheva N, Ebel M, Gabriel L, Hoff KJ. GALBA: Genome Annotation with Miniprot and AUGUSTUS. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.10.536199. [PMID: 37090650 PMCID: PMC10120627 DOI: 10.1101/2023.04.10.536199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2023]
Abstract
The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein- to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a previously unannotated land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.
Collapse
Affiliation(s)
- Tomáš Brůna
- US Department of Energy Joint Genome Institute, Berkeley, CA 94720, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA & Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| | - Joseph Guhlin
- Genomics Aotearoa and Laboratory for Evolution and Development, Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9016, New Zealand
| | - Daniel Honsel
- Institute of Computer Science, University of Göttingen, 37077 Göttingen, Germany
| | - Steffen Herbold
- Faculty for Computer Science and Mathematics, University of Passau, 94032 Passau, Germany
| | - Mario Stanke
- Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Natalia Nenasheva
- Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Matthis Ebel
- Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Lars Gabriel
- Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Katharina J. Hoff
- Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| |
Collapse
|
5
|
Lagarrigue S, Lorthiois M, Degalez F, Gilot D, Derrien T. LncRNAs in domesticated animals: from dog to livestock species. Mamm Genome 2021; 33:248-270. [PMID: 34773482 PMCID: PMC9114084 DOI: 10.1007/s00335-021-09928-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Accepted: 10/19/2021] [Indexed: 11/29/2022]
Abstract
Animal genomes are pervasively transcribed into multiple RNA molecules, of which many will not be translated into proteins. One major component of this transcribed non-coding genome is the long non-coding RNAs (lncRNAs), which are defined as transcripts longer than 200 nucleotides with low coding-potential capabilities. Domestic animals constitute a unique resource for studying the genetic and epigenetic basis of phenotypic variations involving protein-coding and non-coding RNAs, such as lncRNAs. This review presents the current knowledge regarding transcriptome-based catalogues of lncRNAs in major domesticated animals (pets and livestock species), covering a broad phylogenetic scale (from dogs to chicken), and in comparison with human and mouse lncRNA catalogues. Furthermore, we describe different methods to extract known or discover novel lncRNAs and explore comparative genomics approaches to strengthen the annotation of lncRNAs. We then detail different strategies contributing to a better understanding of lncRNA functions, from genetic studies such as GWAS to molecular biology experiments and give some case examples in domestic animals. Finally, we discuss the limitations of current lncRNA annotations and suggest research directions to improve them and their functional characterisation.
Collapse
Affiliation(s)
| | - Matthias Lorthiois
- Univ Rennes, CNRS, IGDR (Institut de Génétique et Développement de Rennes) - UMR 6290, 2 av Prof Leon Bernard, F-35000, Rennes, France
| | - Fabien Degalez
- INRAE, INSTITUT AGRO, PEGASE UMR 1348, 35590, Saint-Gilles, France
| | - David Gilot
- CLCC Eugène Marquis, INSERM, Université Rennes, UMR_S 1242, 35000, Rennes, France
| | - Thomas Derrien
- Univ Rennes, CNRS, IGDR (Institut de Génétique et Développement de Rennes) - UMR 6290, 2 av Prof Leon Bernard, F-35000, Rennes, France.
| |
Collapse
|
6
|
Jung H, Ventura T, Chung JS, Kim WJ, Nam BH, Kong HJ, Kim YO, Jeon MS, Eyun SI. Twelve quick steps for genome assembly and annotation in the classroom. PLoS Comput Biol 2020; 16:e1008325. [PMID: 33180771 PMCID: PMC7660529 DOI: 10.1371/journal.pcbi.1008325] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Eukaryotic genome sequencing and de novo assembly, once the exclusive domain of well-funded international consortia, have become increasingly affordable, thus fitting the budgets of individual research groups. Third-generation long-read DNA sequencing technologies are increasingly used, providing extensive genomic toolkits that were once reserved for a few select model organisms. Generating high-quality genome assemblies and annotations for many aquatic species still presents significant challenges due to their large genome sizes, complexity, and high chromosome numbers. Indeed, selecting the most appropriate sequencing and software platforms and annotation pipelines for a new genome project can be daunting because tools often only work in limited contexts. In genomics, generating a high-quality genome assembly/annotation has become an indispensable tool for better understanding the biology of any species. Herein, we state 12 steps to help researchers get started in genome projects by presenting guidelines that are broadly applicable (to any species), sustainable over time, and cover all aspects of genome assembly and annotation projects from start to finish. We review some commonly used approaches, including practical methods to extract high-quality DNA and choices for the best sequencing platforms and library preparations. In addition, we discuss the range of potential bioinformatics pipelines, including structural and functional annotations (e.g., transposable elements and repetitive sequences). This paper also includes information on how to build a wide community for a genome project, the importance of data management, and how to make the data and results Findable, Accessible, Interoperable, and Reusable (FAIR) by submitting them to a public repository and sharing them with the research community.
Collapse
Affiliation(s)
- Hyungtaek Jung
- School of Biological Sciences, The University of Queensland, St Lucia, Queensland, Australia
- Centre for Agriculture and Bioeconomy, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Tomer Ventura
- Genecology Research Centre, School of Science and Engineering, University of the Sunshine Coast, Sippy Downs, Queensland, Australia
| | - J. Sook Chung
- Institute of Marine and Environmental Technology, University of Maryland Center for Environmental Science, Baltimore, Maryland, United States of America
| | - Woo-Jin Kim
- Genetics and Breeding Research Center, National Institute of Fisheries Science, Geoje, Korea
| | - Bo-Hye Nam
- Biotechnology Research Division, National Institute of Fisheries Science, Busan, Korea
| | - Hee Jeong Kong
- Biotechnology Research Division, National Institute of Fisheries Science, Busan, Korea
| | - Young-Ok Kim
- Biotechnology Research Division, National Institute of Fisheries Science, Busan, Korea
| | - Min-Seung Jeon
- Department of Life Science, Chung-Ang University, Seoul, Korea
| | - Seong-il Eyun
- Department of Life Science, Chung-Ang University, Seoul, Korea
| |
Collapse
|
7
|
Song B, Sang Q, Wang H, Pei H, Gan X, Wang F. Complement Genome Annotation Lift Over Using a Weighted Sequence Alignment Strategy. Front Genet 2019; 10:1046. [PMID: 31850053 PMCID: PMC6902276 DOI: 10.3389/fgene.2019.01046] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Accepted: 09/30/2019] [Indexed: 12/14/2022] Open
Abstract
With the broad application of high-throughput sequencing, more whole-genome resequencing data and de novo assemblies of natural populations are becoming available. For a particular species, in general, only the reference genome is well established and annotated. Computational tools based on sequence alignment have been developed to investigate the gene models of individuals belonging to the same or closely related species. During this process, inconsistent alignment often obscures genome annotation lift over and leads to improper functional impact prediction for a genomic variant, especially in plant species. Here, we proposed the zebraic striped dynamic programming algorithm, which provides different weights to genetic features to refine genome annotation lift over. Testing of our zebraic striped dynamic programming algorithm on both plant and animal genomic data showed complementation to standard sequence approach for highly diverse individuals. Using the lift over genome annotation as anchors, a base-pair resolution genome-wide sequence alignment and variant calling pipeline for de novo assembly has been implemented in the GEAN software. GEAN could be used to compare haplotype diversity, refine the genetic variant functional annotation, annotate de novo assembly genome sequence, detect homologous syntenic blocks, improve the quantification of gene expression levels using RNA-seq data, and unify genomic variants for population genetic analysis. We expect that GEAN will be a standard tool for the coming of age of de novo assembly population genetics.
Collapse
Affiliation(s)
- Baoxing Song
- The Department of Life Science, Qiannan Normal College for Nationalities, Duyun, China
- Department of Comparative Development and Genetics, Max Planck Institute for Plant Breeding Research, Köln, Germany
- Institute for Genomic Diversity, Cornell University, Ithaca, NY, United States
| | - Qing Sang
- Department of Plant Developmental Biology, Max Planck Institute for Plant Breeding Research, Köln, Germany
| | - Hai Wang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Huimin Pei
- The Department of Life Science, Qiannan Normal College for Nationalities, Duyun, China
| | - XiangChao Gan
- Department of Comparative Development and Genetics, Max Planck Institute for Plant Breeding Research, Köln, Germany
| | - Fen Wang
- The Department of Life Science, Qiannan Normal College for Nationalities, Duyun, China
| |
Collapse
|