1
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
2
|
Ludwig J, Mrázek J. OrthoRefine: automated enhancement of prior ortholog identification via synteny. BMC Bioinformatics 2024; 25:163. [PMID: 38664637 PMCID: PMC11044567 DOI: 10.1186/s12859-024-05786-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 04/15/2024] [Indexed: 04/29/2024] Open
Abstract
BACKGROUND Identifying orthologs continues to be an early and imperative step in genome analysis but remains a challenging problem. While synteny (conservation of gene order) has previously been used independently and in combination with other methods to identify orthologs, applying synteny in ortholog identification has yet to be automated in a user-friendly manner. This desire for automation and ease-of-use led us to develop OrthoRefine, a standalone program that uses synteny to refine ortholog identification. RESULTS We developed OrthoRefine to improve the detection of orthologous genes by implementing a look-around window approach to detect synteny. We tested OrthoRefine in tandem with OrthoFinder, one of the most used software for identification of orthologs in recent years. We evaluated improvements provided by OrthoRefine in several bacterial and a eukaryotic dataset. OrthoRefine efficiently eliminates paralogs from orthologous groups detected by OrthoFinder. Using synteny increased specificity and functional ortholog identification; additionally, analysis of BLAST e-value, phylogenetics, and operon occurrence further supported using synteny for ortholog identification. A comparison of several window sizes suggested that smaller window sizes (eight genes) were generally the most suitable for identifying orthologs via synteny. However, larger windows (30 genes) performed better in datasets containing less closely related genomes. A typical run of OrthoRefine with ~ 10 bacterial genomes can be completed in a few minutes on a regular desktop PC. CONCLUSION OrthoRefine is a simple-to-use, standalone tool that automates the application of synteny to improve ortholog detection. OrthoRefine is particularly efficient in eliminating paralogs from orthologous groups delineated by standard methods.
Collapse
Affiliation(s)
- J Ludwig
- Institute of Bioinformatics, The University of Georgia, Athens, GA, 30602, USA.
| | - J Mrázek
- Department of Microbiology and Institute of Bioinformatics, The University of Georgia, Athens, GA, 30602, USA
| |
Collapse
|
3
|
Choquet M, Lenner F, Cocco A, Toullec G, Corre E, Toullec JY, Wallberg A. Comparative Population Transcriptomics Provide New Insight into the Evolutionary History and Adaptive Potential of World Ocean Krill. Mol Biol Evol 2023; 40:msad225. [PMID: 37816123 PMCID: PMC10642690 DOI: 10.1093/molbev/msad225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 08/31/2023] [Accepted: 09/25/2023] [Indexed: 10/12/2023] Open
Abstract
Genetic variation is instrumental for adaptation to changing environments but it is unclear how it is structured and contributes to adaptation in pelagic species lacking clear barriers to gene flow. Here, we applied comparative genomics to extensive transcriptome datasets from 20 krill species collected across the Atlantic, Indian, Pacific, and Southern Oceans. We compared genetic variation both within and between species to elucidate their evolutionary history and genomic bases of adaptation. We resolved phylogenetic interrelationships and uncovered genomic evidence to elevate the cryptic Euphausia similis var. armata into species. Levels of genetic variation and rates of adaptive protein evolution vary widely. Species endemic to the cold Southern Ocean, such as the Antarctic krill Euphausia superba, showed less genetic variation and lower evolutionary rates than other species. This could suggest a low adaptive potential to rapid climate change. We uncovered hundreds of candidate genes with signatures of adaptive evolution among Antarctic Euphausia but did not observe strong evidence of adaptive convergence with the predominantly Arctic Thysanoessa. We instead identified candidates for cold-adaptation that have also been detected in Antarctic fish, including genes that govern thermal reception such as TrpA1. Our results suggest parallel genetic responses to similar selection pressures across Antarctic taxa and provide new insights into the adaptive potential of important zooplankton already affected by climate change.
Collapse
Affiliation(s)
- Marvin Choquet
- Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
- Natural History Museum, University of Oslo, Oslo, Norway
| | - Felix Lenner
- Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
- Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Arianna Cocco
- Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
| | - Gaëlle Toullec
- Laboratory for Biological Geochemistry, School of Architecture, Civil and Environmental Engineering, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Erwan Corre
- CNRS, Sorbonne Université, FR 2424, ABiMS Platform, Station Biologique de Roscoff, Roscoff, France
| | - Jean-Yves Toullec
- CNRS, UMR 7144, AD2M, Sorbonne Université, Station Biologique de Roscoff, Roscoff, France
| | - Andreas Wallberg
- Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
| |
Collapse
|
4
|
Dosch J, Bergmann H, Tran V, Ebersberger I. FAS: assessing the similarity between proteins using multi-layered feature architectures. Bioinformatics 2023; 39:btad226. [PMID: 37084276 PMCID: PMC10185405 DOI: 10.1093/bioinformatics/btad226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 02/23/2023] [Accepted: 04/13/2023] [Indexed: 04/23/2023] Open
Abstract
MOTIVATION Protein sequence comparison is a fundamental element in the bioinformatics toolkit. When sequences are annotated with features such as functional domains, transmembrane domains, low complexity regions or secondary structure elements, the resulting feature architectures allow better informed comparisons. However, many existing schemes for scoring architecture similarities cannot cope with features arising from multiple annotation sources. Those that do fall short in the resolution of overlapping and redundant feature annotations. RESULTS Here, we introduce FAS, a scoring method that integrates features from multiple annotation sources in a directed acyclic architecture graph. Redundancies are resolved as part of the architecture comparison by finding the paths through the graphs that maximize the pair-wise architecture similarity. In a large-scale evaluation on more than 10 000 human-yeast ortholog pairs, architecture similarities assessed with FAS are consistently more plausible than those obtained using e-values to resolve overlaps or leaving overlaps unresolved. Three case studies demonstrate the utility of FAS on architecture comparison tasks: benchmarking of orthology assignment software, identification of functionally diverged orthologs, and diagnosing protein architecture changes stemming from faulty gene predictions. With the help of FAS, feature architecture comparisons can now be routinely integrated into these and many other applications. AVAILABILITY AND IMPLEMENTATION FAS is available as python package: https://pypi.org/project/greedyFAS/.
Collapse
Affiliation(s)
- Julian Dosch
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Holger Bergmann
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Vinh Tran
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Ingo Ebersberger
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
- Senckenberg Biodiversity and Climate Research Centre (S-BIKF), Frankfurt, 60325, Germany
- LOEWE Centre for Translational Biodiversity Genomics (TBG), Frankfurt, 60325, Germany
| |
Collapse
|
5
|
Watanabe T, Kure A, Horiike T. OrthoPhy: A Program to Construct Ortholog Data Sets Using Taxonomic Information. Genome Biol Evol 2023; 15:7044703. [PMID: 36799928 PMCID: PMC9991595 DOI: 10.1093/gbe/evad026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 01/30/2023] [Accepted: 02/13/2023] [Indexed: 02/18/2023] Open
Abstract
Species phylogenetic trees represent the evolutionary processes of organisms, and they are fundamental in evolutionary research. Therefore, new methods have been developed to obtain more reliable species phylogenetic trees. A highly reliable method is the construction of an ortholog data set based on sequence information of genes, which is then used to infer the species phylogenetic tree. However, although methods for constructing an ortholog data set for species phylogenetic analysis have been developed, they cannot remove some paralogs, which is necessary for reliable species phylogenetic inference. To address the limitations of current methods, we developed OrthoPhy, a program that excludes paralogs and constructs highly accurate ortholog data sets using taxonomic information dividing analyzed species into monophyletic groups. OrthoPhy can remove paralogs, detecting inconsistencies between taxonomic information and phylogenetic trees of candidate ortholog groups clustered by sequence similarity. Performance tests using evolutionary simulated sequences and real sequences of 40 bacteria revealed that the precision of ortholog inference by OrthoPhy is higher than that of existing programs. Additionally, the phylogenetic analysis of species was more accurate when performed using ortholog data sets constructed by OrthoPhy than that performed using data sets constructed by existing programs. Furthermore, we performed a benchmark test of the Quest for Orthologs using real sequence data and found that the concordance rate between the phylogenetic trees of orthologs inferred by OrthoPhy and those of species was higher than the rates obtained by other ortholog inference programs. Therefore, ortholog data sets constructed using OrthoPhy enabled a more accurate phylogenetic analysis of species than those constructed using the existing programs, and OrthoPhy can be used for the phylogenetic analysis of species even for distantly related species that have experienced many evolutionary events.
Collapse
Affiliation(s)
- Tomoaki Watanabe
- United Graduate School of Agricultural Science, Gifu University, Gifu, Japan
| | - Akinori Kure
- Graduate School of Integrated Science and Technology, Shizuoka University, Shizuoka, Japan
| | - Tokumasa Horiike
- Department of Bioresource Sciences, Shizuoka University, Shizuoka, Japan
| |
Collapse
|
6
|
Kress A, Poch O, Lecompte O, Thompson JD. Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events. Front Bioinform 2023; 3:1178926. [PMID: 37151482 PMCID: PMC10158824 DOI: 10.3389/fbinf.2023.1178926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 04/05/2023] [Indexed: 05/09/2023] Open
Abstract
Protein annotation errors can have significant consequences in a wide range of fields, ranging from protein structure and function prediction to biomedical research, drug discovery, and biotechnology. By comparing the domains of different proteins, scientists can identify common domains, classify proteins based on their domain architecture, and highlight proteins that have evolved differently in one or more species or clades. However, genome-wide identification of different protein domain architectures involves a complex error-prone pipeline that includes genome sequencing, prediction of gene exon/intron structures, and inference of protein sequences and domain annotations. Here we developed an automated fact-checking approach to distinguish true domain loss/gain events from false events caused by errors that occur during the annotation process. Using genome-wide ortholog sets and taking advantage of the high-quality human and Saccharomyces cerevisiae genome annotations, we analyzed the domain gain and loss events in the predicted proteomes of 9 non-human primates (NHP) and 20 non-S. cerevisiae fungi (NSF) as annotated in the Uniprot and Interpro databases. Our approach allowed us to quantify the impact of errors on estimates of protein domain gains and losses, and we show that domain losses are over-estimated ten-fold and three-fold in the NHP and NSF proteins respectively. This is in line with previous studies of gene-level losses, where issues with genome sequencing or gene annotation led to genes being falsely inferred as absent. In addition, we show that insistent protein domain annotations are a major factor contributing to the false events. For the first time, to our knowledge, we show that domain gains are also over-estimated by three-fold and two-fold respectively in NHP and NSF proteins. Based on our more accurate estimates, we infer that true domain losses and gains in NHP with respect to humans are observed at similar rates, while domain gains in the more divergent NSF are observed twice as frequently as domain losses with respect to S. cerevisiae. This study highlights the need to critically examine the scientific validity of protein annotations, and represents a significant step toward scalable computational fact-checking methods that may 1 day mitigate the propagation of wrong information in protein databases.
Collapse
|
7
|
Kapral TH, Farnhammer F, Zhao W, Lu ZJ, Zagrovic B. Widespread autogenous mRNA-protein interactions detected by CLIP-seq. Nucleic Acids Res 2022; 50:9984-9999. [PMID: 36107779 PMCID: PMC9508846 DOI: 10.1093/nar/gkac756] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2021] [Revised: 07/12/2022] [Accepted: 08/24/2022] [Indexed: 02/02/2023] Open
Abstract
Autogenous interactions between mRNAs and the proteins they encode are implicated in cellular feedback-loop regulation, but their extent and mechanistic foundation are unclear. It was recently hypothesized that such interactions may be common, reflecting the role of intrinsic nucleobase-amino acid affinities in shaping the genetic code's structure. Here we analyze a comprehensive set of CLIP-seq experiments involving multiple protocols and report on widespread autogenous interactions across different organisms. Specifically, 230 of 341 (67%) studied RNA-binding proteins (RBPs) interact with their own mRNAs, with a heavy enrichment among high-confidence hits and a preference for coding sequence binding. We account for different confounding variables, including physical (overexpression and proximity during translation), methodological (difference in CLIP protocols, peak callers and cell types) and statistical (treatment of null backgrounds). In particular, we demonstrate a high statistical significance of autogenous interactions by sampling null distributions of fixed-margin interaction matrices. Furthermore, we study the dependence of autogenous binding on the presence of RNA-binding motifs and structured domains in RBPs. Finally, we show that intrinsic nucleobase-amino acid affinities favor co-aligned binding between mRNA coding regions and the proteins they encode. Our results suggest a central role for autogenous interactions in RBP regulation and support the possibility of a fundamental connection between coding and binding.
Collapse
Affiliation(s)
- Thomas H Kapral
- Departmet of Structural and Computational Biology, Max Perutz Labs, University of Vienna, Vienna, A-1030, Austria,Vienna BioCenter PhD Program, Doctoral School of the University of Vienna and Medical University of Vienna, Vienna, A-1030, Austria
| | - Fiona Farnhammer
- Departmet of Structural and Computational Biology, Max Perutz Labs, University of Vienna, Vienna, A-1030, Austria,Division of Metabolism, University Children's Hospital Zurich and Children's Research Center, University of Zurich, Zurich, 8032, Switzerland,Division of Oncology, University Children's Hospital Zurich and Children's Research Center, University of Zurich, Zurich, 8032, Switzerland
| | - Weihao Zhao
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Zhi J Lu
- MOE Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Bojan Zagrovic
- To whom correspondence should be addressed. Tel: +43 1 4277 52271; Fax: +43 1 4277 9522;
| |
Collapse
|
8
|
Foley S, Vlasova A, Marcet-Houben M, Gabaldón T, Hinman VF. Evolutionary analyses of genes in Echinodermata offer insights towards the origin of metazoan phyla. Genomics 2022; 114:110431. [PMID: 35835427 PMCID: PMC9552553 DOI: 10.1016/j.ygeno.2022.110431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Revised: 05/10/2022] [Accepted: 07/06/2022] [Indexed: 11/24/2022]
Abstract
Despite recent studies discussing the evolutionary impacts of gene duplications and losses among metazoans, the genomic basis for the evolution of phyla remains enigmatic. Here, we employ phylogenomic approaches to search for orthologous genes without known functions among echinoderms, and subsequently use them to guide the identification of their homologs across other metazoans. Our final set of 14 genes was obtained via a suite of homology prediction tools, gene expression data, gene ontology, and generating the Strongylocentrotus purpuratus phylome. The gene set was subjected to selection pressure analyses, which indicated that they are highly conserved and under negative selection. Their presence across broad taxonomic depths suggests that genes required to form a phylum are ancestral to that phylum. Therefore, rather than de novo gene genesis, we posit that evolutionary forces such as selection on existing genomic elements over large timescales may drive divergence and contribute to the emergence of phyla.
Collapse
Affiliation(s)
- Saoirse Foley
- Department of Biological Sciences, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA; Echinobase #6-46, Mellon Institute, 4400 Fifth Ave, Pittsburgh, PA 15213, USA.
| | - Anna Vlasova
- Barcelona Supercomputing Centre (BSC-CNS), Jordi Girona, 29, 08034 Barcelona, Spain; Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028 Barcelona, Spain
| | - Marina Marcet-Houben
- Barcelona Supercomputing Centre (BSC-CNS), Jordi Girona, 29, 08034 Barcelona, Spain; Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028 Barcelona, Spain
| | - Toni Gabaldón
- Barcelona Supercomputing Centre (BSC-CNS), Jordi Girona, 29, 08034 Barcelona, Spain; Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028 Barcelona, Spain; Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain
| | - Veronica F Hinman
- Department of Biological Sciences, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA; Echinobase #6-46, Mellon Institute, 4400 Fifth Ave, Pittsburgh, PA 15213, USA
| |
Collapse
|
9
|
Seçilmiş D, Hillerton T, Sonnhammer ELL. GRNbenchmark - a web server for benchmarking directed gene regulatory network inference methods. Nucleic Acids Res 2022; 50:W398-W404. [PMID: 35609981 PMCID: PMC9252735 DOI: 10.1093/nar/gkac377] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 04/20/2022] [Accepted: 05/19/2022] [Indexed: 11/30/2022] Open
Abstract
Accurate inference of gene regulatory networks (GRN) is an essential component of systems biology, and there is a constant development of new inference methods. The most common approach to assess accuracy for publications is to benchmark the new method against a selection of existing algorithms. This often leads to a very limited comparison, potentially biasing the results, which may stem from tuning the benchmark's properties or incorrect application of other methods. These issues can be avoided by a web server with a broad range of data properties and inference algorithms, that makes it easy to perform comprehensive benchmarking of new methods, and provides a more objective assessment. Here we present https://GRNbenchmark.org/ - a new web server for benchmarking GRN inference methods, which provides the user with a set of benchmarks with several datasets, each spanning a range of properties including multiple noise levels. As soon as the web server has performed the benchmarking, the accuracy results are made privately available to the user via interactive summary plots and underlying curves. The user can then download these results for any purpose, and decide whether or not to make them public to share with the community.
Collapse
Affiliation(s)
- Deniz Seçilmiş
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Thomas Hillerton
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
10
|
Persson E, Sonnhammer ELL. InParanoid-DIAMOND: faster orthology analysis with the InParanoid algorithm. Bioinformatics 2022; 38:2918-2919. [PMID: 35561192 PMCID: PMC9113356 DOI: 10.1093/bioinformatics/btac194] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Revised: 03/14/2022] [Accepted: 03/29/2022] [Indexed: 02/03/2023] Open
Abstract
SUMMARY Predicting orthologs, genes in different species having shared ancestry, is an important task in bioinformatics. Orthology prediction tools are required to make accurate and fast predictions, in order to analyze large amounts of data within a feasible time frame. InParanoid is a well-known algorithm for orthology analysis, shown to perform well in benchmarks, but having the major limitation of long runtimes on large datasets. Here, we present an update to the InParanoid algorithm that can use the faster tool DIAMOND instead of BLAST for the homolog search step. We show that it reduces the runtime by 94%, while still obtaining similar performance in the Quest for Orthologs benchmark. AVAILABILITY AND IMPLEMENTATION The source code is available at (https://bitbucket.org/sonnhammergroup/inparanoid). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Emma Persson
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, 17121 Solna, Sweden
| | | |
Collapse
|
11
|
Nevers Y, Jones TEM, Jyothi D, Yates B, Ferret M, Portell-Silva L, Codo L, Cosentino S, Marcet-Houben M, Vlasova A, Poidevin L, Kress A, Hickman M, Persson E, Piližota I, Guijarro-Clarke C, Iwasaki W, Lecompte O, Sonnhammer E, Roos DS, Gabaldón T, Thybert D, Thomas PD, Hu Y, Emms DM, Bruford E, Capella-Gutierrez S, Martin MJ, Dessimoz C, Altenhoff A. The Quest for Orthologs orthology benchmark service in 2022. Nucleic Acids Res 2022; 50:W623-W632. [PMID: 35552456 PMCID: PMC9252809 DOI: 10.1093/nar/gkac330] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 04/07/2022] [Accepted: 04/30/2022] [Indexed: 11/15/2022] Open
Abstract
The Orthology Benchmark Service (https://orthology.benchmarkservice.org) is the gold standard for orthology inference evaluation, supported and maintained by the Quest for Orthologs consortium. It is an essential resource to compare existing and new methods of orthology inference (the bedrock for many comparative genomics and phylogenetic analysis) over a standard dataset and through common procedures. The Quest for Orthologs Consortium is dedicated to maintaining the resource up to date, through regular updates of the Reference Proteomes and increasingly accessible data through the OpenEBench platform. For this update, we have added a new benchmark based on curated orthology assertion from the Vertebrate Gene Nomenclature Committee, and provided an example meta-analysis of the public predictions present on the platform.
Collapse
Affiliation(s)
- Yannis Nevers
- To whom correspondence should be addressed. Tel: +41 21 692 5449;
| | - Tamsin E M Jones
- HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Dushyanth Jyothi
- Protein Function development, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Bethan Yates
- HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Meritxell Ferret
- Barcelona Supercomputing Centre (BSC-CNS). Plaça Eusebi Güell, 1-3 08034 Barcelona, Spain
| | - Laura Portell-Silva
- Barcelona Supercomputing Centre (BSC-CNS). Plaça Eusebi Güell, 1-3 08034 Barcelona, Spain
| | - Laia Codo
- Barcelona Supercomputing Centre (BSC-CNS). Plaça Eusebi Güell, 1-3 08034 Barcelona, Spain
| | - Salvatore Cosentino
- Department of Biological Sciences, Graduate School of Science, the University of Tokyo, Tokyo, Japan
| | - Marina Marcet-Houben
- Barcelona Supercomputing Centre (BSC-CNS). Plaça Eusebi Güell, 1-3 08034 Barcelona, Spain,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028 Barcelona, Spain
| | - Anna Vlasova
- Barcelona Supercomputing Centre (BSC-CNS). Plaça Eusebi Güell, 1-3 08034 Barcelona, Spain,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028 Barcelona, Spain
| | - Laetitia Poidevin
- Department of Computer Science, ICube, UMR 7357, Centre de Recherche en Biomédecine de Strasbourg, University of Strasbourg, CNRS, Strasbourg, France,BiGEst-ICube Platform, ICube, UMR 7357, Centre de Recherche en Biomédecine de Strasbourg, University of Strasbourg, CNRS, Strasbourg, France
| | - Arnaud Kress
- Department of Computer Science, ICube, UMR 7357, Centre de Recherche en Biomédecine de Strasbourg, University of Strasbourg, CNRS, Strasbourg, France,BiGEst-ICube Platform, ICube, UMR 7357, Centre de Recherche en Biomédecine de Strasbourg, University of Strasbourg, CNRS, Strasbourg, France
| | - Mark Hickman
- Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Emma Persson
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - Ivana Piližota
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Cristina Guijarro-Clarke
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | | | - Wataru Iwasaki
- Department of Biological Sciences, Graduate School of Science, the University of Tokyo, Tokyo, Japan,Department of Integrated Biosciences, Graduate School of Frontier Sciences, the University of Tokyo, Kashiwa, Japan
| | - Odile Lecompte
- Department of Computer Science, ICube, UMR 7357, Centre de Recherche en Biomédecine de Strasbourg, University of Strasbourg, CNRS, Strasbourg, France
| | - Erik Sonnhammer
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - David S Roos
- Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Toni Gabaldón
- Barcelona Supercomputing Centre (BSC-CNS). Plaça Eusebi Güell, 1-3 08034 Barcelona, Spain,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028 Barcelona, Spain,Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain,Centro de Investigaciones Biomédicas en Red de Enfermedades Infecciosas, Barcelona, Spain
| | - David Thybert
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90032, USA
| | - Yanhui Hu
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Harvard University, Boston, MA 02115, USA
| | - David M Emms
- Department of Plant Sciences, University of Oxford, Oxford OX1 3RB, UK
| | - Elspeth Bruford
- HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK,Department of Haematology, University of Cambridge School of Clinical Medicine, Cambridge, UK
| | | | - Maria J Martin
- Protein Function development, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Christophe Dessimoz
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland,Swiss Institute for Bioinformatics, University of Lausanne, Lausanne, Switzerland,Department of Computer Science, University College London, London, UK,Centre for Life's Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, London, UK
| | - Adrian Altenhoff
- Swiss Institute for Bioinformatics, University of Lausanne, Lausanne, Switzerland,Computer Science Department, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
12
|
Crow M, Suresh H, Lee J, Gillis J. Coexpression reveals conserved gene programs that co-vary with cell type across kingdoms. Nucleic Acids Res 2022; 50:4302-4314. [PMID: 35451481 PMCID: PMC9071420 DOI: 10.1093/nar/gkac276] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2021] [Revised: 03/30/2022] [Accepted: 04/08/2022] [Indexed: 12/24/2022] Open
Abstract
What makes a mouse a mouse, and not a hamster? Differences in gene regulation between the two organisms play a critical role. Comparative analysis of gene coexpression networks provides a general framework for investigating the evolution of gene regulation across species. Here, we compare coexpression networks from 37 species and quantify the conservation of gene activity 1) as a function of evolutionary time, 2) across orthology prediction algorithms, and 3) with reference to cell- and tissue-specificity. We find that ancient genes are expressed in multiple cell types and have well conserved coexpression patterns, however they are expressed at different levels across cell types. Thus, differential regulation of ancient gene programs contributes to transcriptional cell identity. We propose that this differential regulation may play a role in cell diversification in both the animal and plant kingdoms.
Collapse
Affiliation(s)
- Megan Crow
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor NY, USA
| | - Hamsini Suresh
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor NY, USA
| | - John Lee
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor NY, USA
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor NY, USA
| |
Collapse
|
13
|
Cohen N, Kahana A, Schuldiner M. A Similarity-Based Method for Predicting Enzymatic Functions in Yeast Uncovers a New AMP Hydrolase. J Mol Biol 2022; 434:167478. [PMID: 35123996 PMCID: PMC9005783 DOI: 10.1016/j.jmb.2022.167478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 01/22/2022] [Accepted: 01/25/2022] [Indexed: 11/01/2022]
Abstract
Despite decades of research and the availability of the full genomic sequence of the baker's yeast Saccharomyces cerevisiae, still a large fraction of its genome is not functionally annotated. This hinders our ability to fully understand cellular activity and suggests that many additional processes await discovery. The recent years have shown an explosion of high-quality genomic and structural data from multiple organisms, ranging from bacteria to mammals. New computational methods now allow us to integrate these data and extract meaningful insights into the functional identity of uncharacterized proteins in yeast. Here, we created a database of sensitive sequence similarity predictions for all yeast proteins. We use this information to identify candidate enzymes for known biochemical reactions whose enzymes are unidentified, and show how this provides a powerful basis for experimental validation. Using one pathway as a test case we pair a new function for the previously uncharacterized enzyme Yhr202w, as an extra-cellular AMP hydrolase in the NAD degradation pathway. Yhr202w, which we now term Smn1 for Scavenger MonoNucleotidase 1, is a highly conserved protein that is similar to the human protein E5NT/CD73, which is associated with multiple cancers. Hence, our new methodology provides a paradigm, that can be adopted to other organisms, for uncovering new enzymatic functions of uncharacterized proteins.
Collapse
Affiliation(s)
- Nir Cohen
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Amit Kahana
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel. https://twitter.com/AmitKahana
| | - Maya Schuldiner
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel.
| |
Collapse
|
14
|
Abstract
Determining the evolutionary relationships between genes is fundamental to comparative biological research. Here, we present SHOOT. SHOOT searches a user query sequence against a database of phylogenetic trees and returns a tree with the query sequence correctly placed within it. We show that SHOOT performs this analysis with comparable speed to a BLAST search. We demonstrate that SHOOT phylogenetic placements are as accurate as conventional tree inference, and it can identify orthologs with high accuracy. In summary, SHOOT is a fast and accurate tool for phylogenetic analyses of novel query sequences. It is available online at www.shoot.bio.
Collapse
|
15
|
Stewart PS, Williamson KS, Boegli L, Hamerly T, White B, Scott L, Hu X, Mumey BM, Franklin MJ, Bothner B, Vital-Lopez FG, Wallqvist A, James GA. Search for a Shared Genetic or Biochemical Basis for Biofilm Tolerance to Antibiotics across Bacterial Species. Antimicrob Agents Chemother 2022;:e0002122. [PMID: 35266829 DOI: 10.1128/aac.00021-22] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Is there a universal genetically programmed defense providing tolerance to antibiotics when bacteria grow as biofilms? A comparison between biofilms of three different bacterial species by transcriptomic and metabolomic approaches uncovered no evidence of one. Single-species biofilms of three bacterial species (Pseudomonas aeruginosa, Staphylococcus aureus, and Acinetobacter baumannii) were grown in vitro for 3 days and then challenged with respective antibiotics (ciprofloxacin, daptomycin, and tigecycline) for an additional 24 h. All three microorganisms displayed reduced susceptibility in biofilms compared to planktonic cultures. Global transcriptomic profiling of gene expression comparing biofilm to planktonic and antibiotic-treated biofilm to untreated biofilm was performed. Extracellular metabolites were measured to characterize the utilization of carbon sources between biofilms, treated biofilms, and planktonic cells. While all three bacteria exhibited a species-specific signature of stationary phase, no conserved gene, gene set, or common functional pathway could be identified that changed consistently across the three microorganisms. Across the three species, glucose consumption was increased in biofilms compared to planktonic cells, and alanine and aspartic acid utilization were decreased in biofilms compared to planktonic cells. The reasons for these changes were not readily apparent in the transcriptomes. No common shift in the utilization pattern of carbon sources was discerned when comparing untreated to antibiotic-exposed biofilms. Overall, our measurements do not support the existence of a common genetic or biochemical basis for biofilm tolerance against antibiotics. Rather, there are likely myriad genes, proteins, and metabolic pathways that influence the physiological state of individual microorganisms in biofilms and contribute to antibiotic tolerance.
Collapse
|
16
|
Thomas PD, Ebert D, Muruganujan A, Mushayahama T, Albou L, Mi H. PANTHER: Making genome-scale phylogenetics accessible to all. Protein Sci 2022; 31:8-22. [PMID: 34717010 PMCID: PMC8740835 DOI: 10.1002/pro.4218] [Citation(s) in RCA: 372] [Impact Index Per Article: 186.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 10/24/2021] [Accepted: 10/26/2021] [Indexed: 02/03/2023]
Abstract
Phylogenetics is a powerful tool for analyzing protein sequences, by inferring their evolutionary relationships to other proteins. However, phylogenetics analyses can be challenging: they are computationally expensive and must be performed carefully in order to avoid systematic errors and artifacts. Protein Analysis THrough Evolutionary Relationships (PANTHER; http://pantherdb.org) is a publicly available, user-focused knowledgebase that stores the results of an extensive phylogenetic reconstruction pipeline that includes computational and manual processes and quality control steps. First, fully reconciled phylogenetic trees (including ancestral protein sequences) are reconstructed for a set of "reference" protein sequences obtained from fully sequenced genomes of organisms across the tree of life. Second, the resulting phylogenetic trees are manually reviewed and annotated with function evolution events: inferred gains and losses of protein function along branches of the phylogenetic tree. Here, we describe in detail the current contents of PANTHER, how those contents are generated, and how they can be used in a variety of applications. The PANTHER knowledgebase can be downloaded or accessed via an extensive API. In addition, PANTHER provides software tools to facilitate the application of the knowledgebase to common protein sequence analysis tasks: exploring an annotated genome by gene function; performing "enrichment analysis" of lists of genes; annotating a single sequence or large batch of sequences by homology; and assessing the likelihood that a genetic variant at a particular site in a protein will have deleterious effects.
Collapse
Affiliation(s)
- Paul D. Thomas
- Division of Bioinformatics, Department of Population and Public Health SciencesUniversity of Southern CaliforniaLos AngelesCaliforniaUSA
| | - Dustin Ebert
- Division of Bioinformatics, Department of Population and Public Health SciencesUniversity of Southern CaliforniaLos AngelesCaliforniaUSA
| | - Anushya Muruganujan
- Division of Bioinformatics, Department of Population and Public Health SciencesUniversity of Southern CaliforniaLos AngelesCaliforniaUSA
| | - Tremayne Mushayahama
- Division of Bioinformatics, Department of Population and Public Health SciencesUniversity of Southern CaliforniaLos AngelesCaliforniaUSA
| | - Laurent‐Philippe Albou
- Division of Bioinformatics, Department of Population and Public Health SciencesUniversity of Southern CaliforniaLos AngelesCaliforniaUSA
| | - Huaiyu Mi
- Division of Bioinformatics, Department of Population and Public Health SciencesUniversity of Southern CaliforniaLos AngelesCaliforniaUSA
| |
Collapse
|
17
|
Fuentes D, Molina M, Chorostecki U, Capella-Gutiérrez S, Marcet-Houben M, Gabaldón T. PhylomeDB V5: an expanding repository for genome-wide catalogues of annotated gene phylogenies. Nucleic Acids Res 2021; 50:D1062-D1068. [PMID: 34718760 PMCID: PMC8728271 DOI: 10.1093/nar/gkab966] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 10/02/2021] [Accepted: 10/05/2021] [Indexed: 12/20/2022] Open
Abstract
PhylomeDB is a unique knowledge base providing public access to minable and browsable catalogues of pre-computed genome-wide collections of annotated sequences, alignments and phylogenies (i.e. phylomes) of homologous genes, as well as to their corresponding phylogeny-based orthology and paralogy relationships. In addition, PhylomeDB trees and alignments can be downloaded for further processing to detect and date gene duplication events, infer past events of inter-species hybridization and horizontal gene transfer, as well as to uncover footprints of selection, introgression, gene conversion, or other relevant evolutionary processes in the genes and organisms of interest. Here, we describe the latest evolution of PhylomeDB (version 5). This new version includes a newly implemented web interface and several new functionalities such as optimized searching procedures, the possibility to create user-defined phylome collections, and a fully redesigned data structure. This release also represents a significant core data expansion, with the database providing access to 534 phylomes, comprising over 8 million trees, and homology relationships for genes in over 6000 species. This makes PhylomeDB the largest and most comprehensive public repository of gene phylogenies. PhylomeDB is available at http://www.phylomedb.org.
Collapse
Affiliation(s)
- Diego Fuentes
- Barcelona Supercomputing Centre (BSC-CNS). Jordi Girona 29, 08034 Barcelona, Spain.,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac 10, 08028 Barcelona, Spain
| | - Manuel Molina
- Barcelona Supercomputing Centre (BSC-CNS). Jordi Girona 29, 08034 Barcelona, Spain.,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac 10, 08028 Barcelona, Spain
| | - Uciel Chorostecki
- Barcelona Supercomputing Centre (BSC-CNS). Jordi Girona 29, 08034 Barcelona, Spain.,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac 10, 08028 Barcelona, Spain
| | | | - Marina Marcet-Houben
- Barcelona Supercomputing Centre (BSC-CNS). Jordi Girona 29, 08034 Barcelona, Spain.,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac 10, 08028 Barcelona, Spain
| | - Toni Gabaldón
- Barcelona Supercomputing Centre (BSC-CNS). Jordi Girona 29, 08034 Barcelona, Spain.,Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac 10, 08028 Barcelona, Spain.,Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain
| |
Collapse
|
18
|
Castresana-Aguirre M, Persson E, Sonnhammer ELL. PathBIX-a web server for network-based pathway annotation with adaptive null models. Bioinform Adv 2021; 1:vbab010. [PMID: 36700096 PMCID: PMC9710673 DOI: 10.1093/bioadv/vbab010] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/08/2021] [Accepted: 06/30/2021] [Indexed: 01/28/2023]
Abstract
Motivation Pathway annotation is a vital tool for interpreting and giving meaning to experimental data in life sciences. Numerous tools exist for this task, where the most recent generation of pathway enrichment analysis tools, network-based methods, utilize biological networks to gain a richer source of information as a basis of the analysis than merely the gene content. Network-based methods use the network crosstalk between the query gene set and the genes in known pathways, and compare this to a null model of random expectation. Results We developed PathBIX, a novel web application for network-based pathway analysis, based on the recently published ANUBIX algorithm which has been shown to be more accurate than previous network-based methods. The PathBIX website performs pathway annotation for 21 species, and utilizes prefetched and preprocessed network data from FunCoup 5.0 networks and pathway data from three databases: KEGG, Reactome, and WikiPathways. Availability https://pathbix.sbc.su.se/. Contact erik.sonnhammer@scilifelab.se. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Miguel Castresana-Aguirre
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm 17121, Sweden
| | - Emma Persson
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm 17121, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm 17121, Sweden,To whom correspondence should be addressed.
| |
Collapse
|
19
|
Tong YB, Shi MW, Qian SH, Chen YJ, Luo ZH, Tu YX, Xiong YL, Geng YJ, Chen C, Chen ZX. GenOrigin: A comprehensive protein-coding gene origination database on the evolutionary timescale of life. J Genet Genomics 2021:S1673-8527(21)00165-X. [PMID: 34538772 DOI: 10.1016/j.jgg.2021.03.018] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Revised: 03/21/2021] [Accepted: 03/29/2021] [Indexed: 11/20/2022]
Abstract
The origination of new genes contributes to the biological diversity of life. New genes may quickly build their network, exert important functions, and generate novel phenotypes. Dating gene age and inferring the origination mechanisms of new genes, like primate-specific genes, is the basis for the functional study of the genes. However, no comprehensive resource of gene age estimates across species is available. Here, we systematically date the age of 9,102,113 protein-coding genes from 565 species in the Ensembl and Ensembl Genomes databases, including 82 bacteria, 57 protists, 134 fungi, 58 plants, 56 metazoa, and 178 vertebrates, using a protein-family-based pipeline with Wagner parsimony algorithm. We also collect gene age estimate data from other studies and uniformly distribute the gene age estimates to time ranges in a million years for comparison across studies. All the data are cataloged into GenOrigin (http://genorigin.chenzxlab.cn/), a user-friendly new database of gene age estimates, where users can browse gene age estimates by species, age, and gene ontology. In GenOrigin, the information such as gene age estimates, annotation, gene ontology, ortholog, and paralog, as well as detailed gene presence/absence views for gene age inference based on the species tree with evolutionary timescale, is provided to researchers for exploring gene functions.
Collapse
|
20
|
Yates B, Gray KA, Jones TEM, Bruford EA. Updates to HCOP: the HGNC comparison of orthology predictions tool. Brief Bioinform 2021; 22:6265175. [PMID: 33959747 PMCID: PMC8574622 DOI: 10.1093/bib/bbab155] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Revised: 03/19/2021] [Accepted: 04/02/2021] [Indexed: 11/15/2022] Open
Abstract
Multiple resources currently exist that predict orthologous relationships between genes. These resources differ both in the methodologies used and in the species they make predictions for. The HGNC Comparison of Orthology Predictions (HCOP) search tool integrates and displays data from multiple ortholog prediction resources for a specified human gene or set of genes. An indication of the reliability of a prediction is provided by the number of resources that support it. HCOP was originally designed to show orthology predictions between human and mouse but has been expanded to include data from a current total of 20 selected vertebrate and model organism species. The HCOP pipeline used to fetch and integrate the information from the disparate ortholog and nomenclature data resources has recently been rewritten, both to enable the inclusion of new data and to take advantage of modern web technologies. Data from HCOP are used extensively in our work naming genes as the Vertebrate Gene Nomenclature Committee (https://vertebrate.genenames.org).
Collapse
Affiliation(s)
- Bethan Yates
- HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Kristian A Gray
- HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Tamsin E M Jones
- HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Elspeth A Bruford
- HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.,Department of Haematology, University of Cambridge School of Clinical Medicine, Cambridge Biomedical Campus, Cambridge CB2 0AW, UK
| |
Collapse
|
21
|
Minot SS, Barry KC, Kasman C, Golob JL, Willis AD. geneshot: gene-level metagenomics identifies genome islands associated with immunotherapy response. Genome Biol 2021; 22:135. [PMID: 33952321 PMCID: PMC8097837 DOI: 10.1186/s13059-021-02355-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 04/16/2021] [Indexed: 01/28/2023] Open
Abstract
Researchers must be able to generate experimentally testable hypotheses from sequencing-based observational microbiome experiments to discover the mechanisms underlying the influence of gut microbes on human health. We describe geneshot, a novel bioinformatics tool for identifying testable hypotheses based on gene-level metagenomic analysis of WGS microbiome data. By applying geneshot to two independent previously published cohorts, we identify microbial genomic islands consistently associated with response to immune checkpoint inhibitor (ICI)-based cancer treatment in culturable type strains. The identified genomic islands are within operons involved in type II secretion, TonB-dependent transport, and bacteriophage growth.
Collapse
Affiliation(s)
- Samuel S Minot
- Microbiome Research Initiative, Fred Hutch Cancer Research Center, Mail Stop E4-100, 1100 Fairview Ave. North, Seattle, WA, 98109, USA.
| | - Kevin C Barry
- Public Health Sciences Division and Immunotherapy Integrated Research Center, Fred Hutch Cancer Research Center, Seattle, WA, USA
| | | | - Jonathan L Golob
- Division of Infectious Diseases, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Amy D Willis
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| |
Collapse
|
22
|
Glover N, Sheppard S, Dessimoz C. Homoeolog Inference Methods Requiring Bidirectional Best Hits or Synteny Miss Many Pairs. Genome Biol Evol 2021; 13:6237894. [PMID: 33871639 PMCID: PMC8214411 DOI: 10.1093/gbe/evab077] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/12/2021] [Indexed: 12/22/2022] Open
Abstract
Homoeologs are pairs of genes or chromosomes in the same species that originated by speciation and were brought back together in the same genome by allopolyploidization. Bioinformatic methods for accurate homoeology inference are crucial for studying the evolutionary consequences of polyploidization, and homoeology is typically inferred on the basis of bidirectional best hit (BBH) and/or positional conservation (synteny). However, these methods neglect the fact that genes can duplicate and move, both prior to and after the allopolyploidization event. These duplications and movements can result in many-to-many and/or nonsyntenic homoeologs-which thus remain undetected and unstudied. Here, using the allotetraploid upland cotton (Gossypium hirsutum) as a case study, we show that conventional approaches indeed miss a substantial proportion of homoeologs. Additionally, we found that many of the missed pairs of homoeologs are broadly and highly expressed. A gene ontology analysis revealed a high proportion of the nonsyntenic and non-BBH homoeologs to be involved in protein translation and are likely to contribute to the functional repertoire of cotton. Thus, from an evolutionary and functional genomics standpoint, choosing a homoeolog inference method which does not solely rely on 1:1 relationship cardinality or synteny is crucial for not missing these potentially important homoeolog pairs.
Collapse
Affiliation(s)
- Natasha Glover
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Switzerland
| | | | - Christophe Dessimoz
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Switzerland.,Department of Genetics, Evolution, and Environment, University College London, United Kingdom.,Department of Computer Science, University College London, United Kingdom
| |
Collapse
|
23
|
Linard B, Ebersberger I, McGlynn SE, Glover N, Mochizuki T, Patricio M, Lecompte O, Nevers Y, Thomas PD, Gabaldón T, Sonnhammer E, Dessimoz C, Uchiyama I. Ten Years of Collaborative Progress in the Quest for Orthologs. Mol Biol Evol 2021; 38:3033-3045. [PMID: 33822172 PMCID: PMC8321534 DOI: 10.1093/molbev/msab098] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 02/07/2021] [Accepted: 04/01/2021] [Indexed: 12/19/2022] Open
Abstract
Accurate determination of the evolutionary relationships between genes is a foundational challenge in biology. Homology-evolutionary relatedness-is in many cases readily determined based on sequence similarity analysis. By contrast, whether or not two genes directly descended from a common ancestor by a speciation event (orthologs) or duplication event (paralogs) is more challenging, yet provides critical information on the history of a gene. Since 2009, this task has been the focus of the Quest for Orthologs (QFO) Consortium. The sixth QFO meeting took place in Okazaki, Japan in conjunction with the 67th National Institute for Basic Biology conference. Here, we report recent advances, applications, and oncoming challenges that were discussed during the conference. Steady progress has been made toward standardization and scalability of new and existing tools. A feature of the conference was the presentation of a panel of accessible tools for phylogenetic profiling and several developments to bring orthology beyond the gene unit-from domains to networks. This meeting brought into light several challenges to come: leveraging orthology computations to get the most of the incoming avalanche of genomic data, integrating orthology from domain to biological network levels, building better gene models, and adapting orthology approaches to the broad evolutionary and genomic diversity recognized in different forms of life and viruses.
Collapse
Affiliation(s)
- Benjamin Linard
- LIRMM, University of Montpellier, CNRS, Montpellier, France.,SPYGEN, Le Bourget-du-Lac, France
| | - Ingo Ebersberger
- Institute of Cell Biology and Neuroscience, Goethe University Frankfurt, Frankfurt, Germany.,Senckenberg Biodiversity and Climate Research Centre (S-BIKF), Frankfurt, Germany.,LOEWE Center for Translational Biodiversity Genomics (TBG), Frankfurt, Germany
| | - Shawn E McGlynn
- Earth-Life Science Institute, Tokyo Institute of Technology, Meguro, Tokyo, Japan.,Blue Marble Space Institute of Science, Seattle, WA, USA
| | - Natasha Glover
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Tomohiro Mochizuki
- Earth-Life Science Institute, Tokyo Institute of Technology, Meguro, Tokyo, Japan
| | - Mateus Patricio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Odile Lecompte
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de Médecine Translationnelle de Strasbourg, Strasbourg, France
| | - Yannis Nevers
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA
| | - Toni Gabaldón
- Barcelona Supercomputing Centre (BCS-CNS), Jordi Girona, Barcelona, Spain.,Institute for Research in Biomedicine (IRB), The Barcelona Institute of Science and Technology (BIST), Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - Erik Sonnhammer
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - Christophe Dessimoz
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.,Department of Computer Science, University College London, London, United Kingdom.,Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
| | - Ikuo Uchiyama
- Department of Theoretical Biology, National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Aichi, Japan
| | | |
Collapse
|
24
|
Edwards RJ, Field MA, Ferguson JM, Dudchenko O, Keilwagen J, Rosen BD, Johnson GS, Rice ES, Hillier LD, Hammond JM, Towarnicki SG, Omer A, Khan R, Skvortsova K, Bogdanovic O, Zammit RA, Aiden EL, Warren WC, Ballard JWO. Chromosome-length genome assembly and structural variations of the primal Basenji dog (Canis lupus familiaris) genome. BMC Genomics 2021; 22:188. [PMID: 33726677 PMCID: PMC7962210 DOI: 10.1186/s12864-021-07493-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 02/28/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Basenjis are considered an ancient dog breed of central African origins that still live and hunt with tribesmen in the African Congo. Nicknamed the barkless dog, Basenjis possess unique phylogeny, geographical origins and traits, making their genome structure of great interest. The increasing number of available canid reference genomes allows us to examine the impact the choice of reference genome makes with regard to reference genome quality and breed relatedness. RESULTS Here, we report two high quality de novo Basenji genome assemblies: a female, China (CanFam_Bas), and a male, Wags. We conduct pairwise comparisons and report structural variations between assembled genomes of three dog breeds: Basenji (CanFam_Bas), Boxer (CanFam3.1) and German Shepherd Dog (GSD) (CanFam_GSD). CanFam_Bas is superior to CanFam3.1 in terms of genome contiguity and comparable overall to the high quality CanFam_GSD assembly. By aligning short read data from 58 representative dog breeds to three reference genomes, we demonstrate how the choice of reference genome significantly impacts both read mapping and variant detection. CONCLUSIONS The growing number of high-quality canid reference genomes means the choice of reference genome is an increasingly critical decision in subsequent canid variant analyses. The basal position of the Basenji makes it suitable for variant analysis for targeted applications of specific dog breeds. However, we believe more comprehensive analyses across the entire family of canids is more suited to a pangenome approach. Collectively this work highlights the importance the choice of reference genome makes in all variation studies.
Collapse
Affiliation(s)
- Richard J. Edwards
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052 Australia
| | - Matt A. Field
- Centre for Tropical Bioinformatics and Molecular Biology, Australian Institute of Tropical Health and Medicine, James Cook University, Cairns, QLD 4878 Australia
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
| | - James M. Ferguson
- Kinghorn Center for Clinical Genomics, Garvan Institute of Medical Research, Victoria Street, Darlinghurst, NSW 2010 Australia
| | - Olga Dudchenko
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX USA
- Department of Computer Science, Rice University, Houston, TX USA
- Center for Theoretical and Biological Physics, Rice University, Houston, TX USA
| | - Jens Keilwagen
- Julius Kühn-Institut, Erwin-Baur-Str, 27 06484 Quedlinburg, Germany
| | - Benjamin D. Rosen
- Animal Genomics and Improvement Laboratory, Agricultural Research Service USDA, Beltsville, MD 20705 USA
| | - Gary S. Johnson
- Department of Veterinary Pathobiology, University of Missouri, Columbia, MO 65211 USA
| | - Edward S. Rice
- Department of Surgery, University of Missouri, Columbia, MO 65211 USA
| | | | - Jillian M. Hammond
- Kinghorn Center for Clinical Genomics, Garvan Institute of Medical Research, Victoria Street, Darlinghurst, NSW 2010 Australia
| | - Samuel G. Towarnicki
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052 Australia
| | - Arina Omer
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX USA
- Department of Computer Science, Rice University, Houston, TX USA
| | - Ruqayya Khan
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX USA
- Department of Computer Science, Rice University, Houston, TX USA
| | - Ksenia Skvortsova
- Genomics and Epigenetics Division, Garvan Institute of Medical Research, Victoria Street, Darlinghurst, NSW 2010 Australia
- St Vincent’s Clinical School, Faculty of Medicine, University of New South Wales, Sydney, NSW 2010 Australia
| | - Ozren Bogdanovic
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052 Australia
- Genomics and Epigenetics Division, Garvan Institute of Medical Research, Victoria Street, Darlinghurst, NSW 2010 Australia
| | - Robert A. Zammit
- Vineyard Veterinary Hospital, 703 Windsor Rd, Vineyard, NSW 2765 Australia
| | - Erez Lieberman Aiden
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX USA
- Department of Computer Science, Rice University, Houston, TX USA
- Center for Theoretical and Biological Physics, Rice University, Houston, TX USA
- Faculty of Science, UWA School of Agriculture and Environment, University of Western Australia, Perth, WA 6009 Australia
- Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, Shanghai, China
| | - Wesley C. Warren
- Department of Animal Sciences, University of Missouri, Columbia, MO 65211 Australia
| | - J. William O. Ballard
- Department of Ecology, Environment and Evolution, La Trobe University, Melbourne, Victoria 3086 Australia
- School of Biosciences, University of Melbourne, Parkville, Victoria 3052 Australia
| |
Collapse
|
25
|
Persson E, Castresana-Aguirre M, Buzzao D, Guala D, Sonnhammer ELL. FunCoup 5: Functional Association Networks in All Domains of Life, Supporting Directed Links and Tissue-Specificity. J Mol Biol 2021; 433:166835. [PMID: 33539890 DOI: 10.1016/j.jmb.2021.166835] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Revised: 12/18/2020] [Accepted: 01/15/2021] [Indexed: 02/07/2023]
Abstract
FunCoup (https://funcoup.sbc.su.se) is one of the most comprehensive functional association networks of genes/proteins available. Functional associations are inferred by integrating different types of evidence using a redundancy-weighted naïve Bayesian approach, combined with orthology transfer. FunCoup's high coverage comes from using eleven different types of evidence, and extensive transfer of information between species. Since the latest update of the database, the availability of source data has improved drastically, and user expectations on a tool for functional associations have grown. To meet these requirements, we have made a new release of FunCoup with updated source data and improved functionality. FunCoup 5 now includes 22 species from all domains of life, and the source data for evidences, gold standards, and genomes have been updated to the latest available versions. In this new release, directed regulatory links inferred from transcription factor binding can be visualized in the network viewer for the human interactome. Another new feature is the possibility to filter by genes expressed in a certain tissue in the network viewer. FunCoup 5 further includes the SARS-CoV-2 proteome, allowing users to visualize and analyze interactions between SARS-CoV-2 and human proteins in order to better understand COVID-19. This new release of FunCoup constitutes a major advance for the users, with updated sources, new species and improved functionality for analysis of the networks.
Collapse
Affiliation(s)
- Emma Persson
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Miguel Castresana-Aguirre
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Davide Buzzao
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Dimitri Guala
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden.
| |
Collapse
|
26
|
Abstract
Orthobench is the standard benchmark to assess the accuracy of orthogroup inference methods. It contains 70 expert-curated reference orthogroups (RefOGs) that span the Bilateria and cover a range of different challenges for orthogroup inference. Here, we leveraged improvements in tree inference algorithms and computational resources to reinterrogate these RefOGs and carry out an extensive phylogenetic delineation of their composition. This phylogenetic revision altered the membership of 31 of the 70 RefOGs, with 24 subject to extensive revision and 7 that required minor changes. We further used these revised and updated RefOGs to provide an assessment of the orthogroup inference accuracy of widely used orthogroup inference methods. Finally, we provide an open-source benchmarking suite to support the future development and use of the Orthobench benchmark.
Collapse
Affiliation(s)
- David M Emms
- Department of Plant Sciences, University of Oxford, United Kingdom
| | - Steven Kelly
- Department of Plant Sciences, University of Oxford, United Kingdom
| |
Collapse
|