1
|
Percudani R, De Rito C. Predicting Protein Function in the AI and Big Data Era. Biochemistry 2025. [PMID: 40380914 DOI: 10.1021/acs.biochem.5c00186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2025]
Abstract
It is an exciting time for researchers working to link proteins to their functions. Most techniques for extracting functional information from genomic sequences were developed several years ago, with major progress driven by the availability of big data. Now, groundbreaking advances in deep-learning and AI-based methods have enriched protein databases with three-dimensional information and offer the potential to predict biochemical properties and biomolecular interactions, providing key functional insights. This progress is expected to increase the proportion of functionally bright proteins in databases and deepen our understanding of life at the molecular level.
Collapse
Affiliation(s)
- Riccardo Percudani
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, 43124 Parma, Italy
| | - Carlo De Rito
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, 43124 Parma, Italy
| |
Collapse
|
2
|
Lakshman AH, Wright ES. EvoWeaver: large-scale prediction of gene functional associations from coevolutionary signals. Nat Commun 2025; 16:3878. [PMID: 40274827 PMCID: PMC12022180 DOI: 10.1038/s41467-025-59175-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2025] [Accepted: 04/09/2025] [Indexed: 04/26/2025] Open
Abstract
The known universe of uncharacterized proteins is expanding far faster than our ability to annotate their functions through laboratory study. Computational annotation approaches rely on similarity to previously studied proteins, thereby ignoring unstudied proteins. Coevolutionary approaches hold promise for injecting new information into our knowledge of the protein universe by linking proteins through 'guilt-by-association'. However, existing coevolutionary algorithms have insufficient accuracy and scalability to connect the entire universe of proteins. We present EvoWeaver, a method that weaves together 12 signals of coevolution to quantify the degree of shared evolution between genes. EvoWeaver accurately identifies proteins involved in protein complexes or separate steps of a biochemical pathway. We show the merits of EvoWeaver by partly reconstructing known biochemical pathways without any prior knowledge other than that available from genomic sequences. Applying EvoWeaver to 1545 gene groups from 8564 genomes reveals missing connections in popular databases and potentially undiscovered links between proteins.
Collapse
Affiliation(s)
- Aidan H Lakshman
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Erik S Wright
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
- Center for Evolutionary Biology and Medicine, Pittsburgh, PA, USA.
| |
Collapse
|
3
|
Moi D, Dessimoz C. Phylogenetic profiling in eukaryotes comes of age. Proc Natl Acad Sci U S A 2023; 120:e2305013120. [PMID: 37126713 PMCID: PMC10175774 DOI: 10.1073/pnas.2305013120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2023] Open
Affiliation(s)
- David Moi
- Department of Computational Biology, University of Lausanne, 1015Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015Lausanne, Switzerland
| | - Christophe Dessimoz
- Department of Computational Biology, University of Lausanne, 1015Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015Lausanne, Switzerland
| |
Collapse
|
4
|
Dembech E, Malatesta M, De Rito C, Mori G, Cavazzini D, Secchi A, Morandin F, Percudani R. Identification of hidden associations among eukaryotic genes through statistical analysis of coevolutionary transitions. Proc Natl Acad Sci U S A 2023; 120:e2218329120. [PMID: 37043529 PMCID: PMC10120013 DOI: 10.1073/pnas.2218329120] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Accepted: 03/10/2023] [Indexed: 04/13/2023] Open
Abstract
Coevolution at the gene level, as reflected by correlated events of gene loss or gain, can be revealed by phylogenetic profile analysis. The optimal method and metric for comparing phylogenetic profiles, especially in eukaryotic genomes, are not yet established. Here, we describe a procedure suitable for large-scale analysis, which can reveal coevolution based on the assessment of the statistical significance of correlated presence/absence transitions between gene pairs. This metric can identify coevolution in profiles with low overall similarities and is not affected by similarities lacking coevolutionary information. We applied the procedure to a large collection of 60,912 orthologous gene groups (orthogroups) in 1,264 eukaryotic genomes extracted from OrthoDB. We found significant cotransition scores for 7,825 orthogroups associated in 2,401 coevolving modules linking known and unknown genes in protein complexes and biological pathways. To demonstrate the ability of the method to predict hidden gene associations, we validated through experiments the involvement of vertebrate malate synthase-like genes in the conversion of (S)-ureidoglycolate into glyoxylate and urea, the last step of purine catabolism. This identification explains the presence of glyoxylate cycle genes in metazoa and suggests an anaplerotic role of purine degradation in early eukaryotes.
Collapse
Affiliation(s)
- Elena Dembech
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Marco Malatesta
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Carlo De Rito
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Giulia Mori
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Davide Cavazzini
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Andrea Secchi
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Francesco Morandin
- Department of Mathematical, Physical and Computer Sciences, University of Parma, Parma43124, Italy
| | - Riccardo Percudani
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| |
Collapse
|
5
|
Zaydman MA, Little AS, Haro F, Aksianiuk V, Buchser WJ, DiAntonio A, Gordon JI, Milbrandt J, Raman AS. Defining hierarchical protein interaction networks from spectral analysis of bacterial proteomes. eLife 2022; 11:e74104. [PMID: 35976223 PMCID: PMC9427106 DOI: 10.7554/elife.74104] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Accepted: 08/17/2022] [Indexed: 11/25/2022] Open
Abstract
Cellular behaviors emerge from layers of molecular interactions: proteins interact to form complexes, pathways, and phenotypes. We show that hierarchical networks of protein interactions can be defined from the statistical pattern of proteome variation measured across thousands of diverse bacteria and that these networks reflect the emergence of complex bacterial phenotypes. Our results are validated through gene-set enrichment analysis and comparison to existing experimentally derived databases. We demonstrate the biological utility of our approach by creating a model of motility in Pseudomonas aeruginosa and using it to identify a protein that affects pilus-mediated motility. Our method, SCALES (Spectral Correlation Analysis of Layered Evolutionary Signals), may be useful for interrogating genotype-phenotype relationships in bacteria.
Collapse
Affiliation(s)
- Mark A Zaydman
- Department of Pathology and Immunology, Washington University School of MedicineSt LouisUnited States
| | | | - Fidel Haro
- Duchossois Family Institute, University of ChicagoChicagoUnited States
| | | | - William J Buchser
- Department of Genetics, Washington University School of MedicineSt LouisUnited States
| | - Aaron DiAntonio
- Department of Developmental Biology, Washington University School of MedicineSt LouisUnited States
| | - Jeffrey I Gordon
- Department of Pathology and Immunology, Washington University School of MedicineSt LouisUnited States
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of MedicineSt LouisUnited States
| | - Jeffrey Milbrandt
- Department of Genetics, Washington University School of MedicineSt LouisUnited States
| | - Arjun S Raman
- Duchossois Family Institute, University of ChicagoChicagoUnited States
- Department of Pathology, University of Chicago, ChicagoChicagoUnited States
- Center for the Physics of Evolving Systems, University of Chicago, ChicagoChicagoUnited States
| |
Collapse
|
6
|
Liu C, Kenney T, Beiko RG, Gu H. The Community Coevolution Model with Application to the Study of Evolutionary Relationships between Genes based on Phylogenetic Profiles. Syst Biol 2022:6651862. [PMID: 35904761 DOI: 10.1093/sysbio/syac052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Revised: 07/15/2022] [Accepted: 07/19/2022] [Indexed: 11/13/2022] Open
Abstract
Organismal traits can evolve in a coordinated way, with correlated patterns of gains and losses reflecting important evolutionary associations. Discovering these associations can reveal important information about the functional and ecological linkages among traits. Phylogenetic profiles treat individual genes as traits distributed across sets of genomes and can provide a fine-grained view of the genetic underpinnings of evolutionary processes in a set of genomes. Phylogenetic profiling has been used to identify genes that are functionally linked, and to identify common patterns of lateral gene transfer in microorganisms. However, comparative analysis of phylogenetic profiles and other trait distributions should take into account the phylogenetic relationships among the organisms under consideration. Here we propose the Community Coevolution Model (CCM), a new coevolutionary model to analyze the evolutionary associations among traits, with a focus on phylogenetic profiles. In the CCM, traits are considered to evolve as a community with interactions, and the transition rate for each trait depends on the current states of other traits. Surpassing other comparative methods for pairwise trait analysis, CCM has the additional advantage of being able to examine multiple traits as a community to reveal more dependency relationships. We also develop a simulation procedure to generate phylogenetic profiles with correlated evolutionary patterns that can be used as benchmark data for evaluation purposes. A simulation study demonstrates that CCM is more accurate than other methods including the Jaccard Index and three tree-aware methods. The parameterization of CCM makes the interpretation of the relations between genes more direct, which leads to Darwin's scenario being identified easily based on the estimated parameters. We show that CCM is more efficient and fits real data better than other methods resulting in higher likelihood scores with fewer parameters. An examination of 3786 phylogenetic profiles across a set of 659 bacterial genomes highlights linkages between genes with common functions, including many patterns that would not have been identified under a non-phylogenetic model of common distribution. We also applied the CCM to 44 proteins in the well-studied Mitochondrial Respiratory Complex I and recovered associations that mapped well onto the structural associations that exist in the complex.
Collapse
Affiliation(s)
- Chaoyue Liu
- Department of Mathematics and Statistics, Dalhousie University, Halifax, B3H 4R2, Canada.,Faculty of Computer Science, Dalhousie University, Halifax, B3H 4R2, Canada
| | - Toby Kenney
- Department of Mathematics and Statistics, Dalhousie University, Halifax, B3H 4R2, Canada
| | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, B3H 4R2, Canada
| | - Hong Gu
- Department of Mathematics and Statistics, Dalhousie University, Halifax, B3H 4R2, Canada
| |
Collapse
|
7
|
Gerardos A, Dietler N, Bitbol AF. Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput Biol 2022; 18:e1010147. [PMID: 35576238 PMCID: PMC9135348 DOI: 10.1371/journal.pcbi.1010147] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 05/26/2022] [Accepted: 04/27/2022] [Indexed: 11/19/2022] Open
Abstract
Inferring protein-protein interactions from sequences is an important task in computational biology. Recent methods based on Direct Coupling Analysis (DCA) or Mutual Information (MI) allow to find interaction partners among paralogs of two protein families. Does successful inference mainly rely on correlations from structural contacts or from phylogeny, or both? Do these two types of signal combine constructively or hinder each other? To address these questions, we generate and analyze synthetic data produced using a minimal model that allows us to control the amounts of structural constraints and phylogeny. We show that correlations from these two sources combine constructively to increase the performance of partner inference by DCA or MI. Furthermore, signal from phylogeny can rescue partner inference when signal from contacts becomes less informative, including in the realistic case where inter-protein contacts are restricted to a small subset of sites. We also demonstrate that DCA-inferred couplings between non-contact pairs of sites improve partner inference in the presence of strong phylogeny, while deteriorating it otherwise. Moreover, restricting to non-contact pairs of sites preserves inference performance in the presence of strong phylogeny. In a natural data set, as well as in realistic synthetic data based on it, we find that non-contact pairs of sites contribute positively to partner inference performance, and that restricting to them preserves performance, evidencing an important role of phylogeny.
Collapse
Affiliation(s)
- Andonis Gerardos
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
8
|
Deutekom ES, van Dam TJP, Snel B. Phylogenetic profiling in eukaryotes: The effect of species, orthologous group, and interactome selection on protein interaction prediction. PLoS One 2022; 17:e0251833. [PMID: 35421089 PMCID: PMC9009711 DOI: 10.1371/journal.pone.0251833] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 03/15/2022] [Indexed: 12/04/2022] Open
Abstract
Phylogenetic profiling in eukaryotes is of continued interest to study and predict the functional relationships between proteins. This interest is likely driven by the increased number of available diverse genomes and computational methods to infer orthologies. The evaluation of phylogenetic profiles has mainly focussed on reference genome selection in prokaryotes. However, it has been proven to be challenging to obtain high prediction accuracies in eukaryotes. As part of our recent comparison of orthology inference methods for eukaryotic genomes, we observed a surprisingly high performance for predicting interacting orthologous groups. This high performance, in turn, prompted the question of what factors influence the success of phylogenetic profiling when applied to eukaryotic genomes. Here we analyse the effect of species, orthologous group and interactome selection on protein interaction prediction using phylogenetic profiles. We select species based on the diversity and quality of the genomes and compare this supervised selection with randomly generated genome subsets. We also analyse the effect on the performance of orthologous groups defined to be in the last eukaryotic common ancestor of eukaryotes to that of orthologous groups that are not. Finally, we consider the effects of reference interactome set filtering and reference interactome species. In agreement with other studies, we find an effect of genome selection based on quality, less of an effect based on genome diversity, but a more notable effect based on the amount of information contained within the genomes. Most importantly, we find it is not merely selecting the correct genomes that is important for high prediction performance. Other choices in meta parameters such as orthologous group selection, the reference species of the interaction set, and the quality of the interaction set have a much larger impact on the performance when predicting protein interactions using phylogenetic profiles. These findings shed light on the differences in reported performance amongst phylogenetic profiles approaches, and reveal on a more fundamental level for which types of protein interactions this method has most promise when applied to eukaryotes.
Collapse
Affiliation(s)
- Eva S. Deutekom
- Department of Biology, Utrecht University, Utrecht, The Netherlands
| | | | - Berend Snel
- Department of Biology, Utrecht University, Utrecht, The Netherlands
- * E-mail:
| |
Collapse
|
9
|
Fukunaga T, Iwasaki W. Inverse Potts model improves accuracy of phylogenetic profiling. Bioinformatics 2022; 38:1794-1800. [PMID: 35060594 PMCID: PMC8963296 DOI: 10.1093/bioinformatics/btac034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 01/11/2022] [Accepted: 01/13/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Phylogenetic profiling is a powerful computational method for revealing the functions of function-unknown genes. Although conventional similarity metrics in phylogenetic profiling achieved high prediction accuracy, they have two estimation biases: an evolutionary bias and a spurious correlation bias. While previous studies reduced the evolutionary bias by considering a phylogenetic tree, few studies have analyzed the spurious correlation bias. RESULTS To reduce the spurious correlation bias, we developed metrics based on the inverse Potts model (IPM) for phylogenetic profiling. We also developed a metric based on both the IPM and a phylogenetic tree. In an empirical dataset analysis, we demonstrated that these IPM-based metrics improved the prediction performance of phylogenetic profiling. In addition, we found that the integration of several metrics, including the IPM-based metrics, had superior performance to a single metric. AVAILABILITY AND IMPLEMENTATION The source code is freely available at https://github.com/fukunagatsu/Ipm. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Wataru Iwasaki
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 2770882, Japan,Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo 1130032, Japan,Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 2770882, Japan,Atmosphere and Ocean Research Institute, The University of Tokyo, Chiba 2770882, Japan,Institute for Quantitative Biosciences, The University of Tokyo, Tokyo 1130032, Japan,Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Tokyo 1130032, Japan
| |
Collapse
|
10
|
Fukunaga T, Iwasaki W. Mirage: estimation of ancestral gene-copy numbers by considering different evolutionary patterns among gene families. BIOINFORMATICS ADVANCES 2021; 1:vbab014. [PMID: 36700099 PMCID: PMC9710636 DOI: 10.1093/bioadv/vbab014] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Revised: 07/22/2021] [Accepted: 07/28/2021] [Indexed: 01/28/2023]
Abstract
Motivation Reconstruction of gene copy number evolution is an essential approach for understanding how complex biological systems have been organized. Although various models have been proposed for gene copy number evolution, existing evolutionary models have not appropriately addressed the fact that different gene families can have very different gene gain/loss rates. Results In this study, we developed Mirage (MIxtuRe model for Ancestral Genome Estimation), which allows different gene families to have flexible gene gain/loss rates. Mirage can use three models for formulating heterogeneous evolution among gene families: the discretized Γ model, probability distribution-free model and pattern mixture (PM) model. Simulation analysis showed that Mirage can accurately estimate heterogeneous gene gain/loss rates and reconstruct gene-content evolutionary history. Application to empirical datasets demonstrated that the PM model fits genome data from various taxonomic groups better than the other heterogeneous models. Using Mirage, we revealed that metabolic function-related gene families displayed frequent gene gains and losses in all taxa investigated. Availability and implementation The source code of Mirage is freely available at https://github.com/fukunagatsu/Mirage. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo 1690051, Japan,Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 1130032, Japan,To whom correspondence should be addressed. or
| | - Wataru Iwasaki
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 2770882, Japan,Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo 1130032, Japan,Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 2770882, Japan,Atmosphere and Ocean Research Institute, The University of Tokyo, Chiba 2770882, Japan,Institute for Quantitative Biosciences, The University of Tokyo, Tokyo 1130032, Japan,Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Tokyo 1130032, Japan,To whom correspondence should be addressed. or
| |
Collapse
|
11
|
Pollak S, Gralka M, Sato Y, Schwartzman J, Lu L, Cordero OX. Public good exploitation in natural bacterioplankton communities. SCIENCE ADVANCES 2021; 7:eabi4717. [PMID: 34321201 PMCID: PMC8318375 DOI: 10.1126/sciadv.abi4717] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 06/10/2021] [Indexed: 05/19/2023]
Abstract
Bacteria often interact with their environment through extracellular molecules that increase access to limiting resources. These secretions can act as public goods, creating incentives for exploiters to invade and "steal" public goods away from producers. This phenomenon has been studied extensively in vitro, but little is known about the occurrence and impact of public good exploiters in the environment. Here, we develop a genomic approach to systematically identify bacteria that can exploit public goods produced during the degradation of polysaccharides. Focusing on chitin, a highly abundant marine biopolymer, we show that public good exploiters are active in natural chitin degrading microbial communities, invading early during colonization, and potentially hindering degradation. In contrast to in vitro studies, we find that exploiters and degraders belong to distant lineages, facilitating their coexistence. Our approach opens novel avenues to use the wealth of genomic data available to infer ecological roles and interactions among microbes.
Collapse
Affiliation(s)
- Shaul Pollak
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Matti Gralka
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Yuya Sato
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Environmental Management Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), 16-1 Onogawa, Tsukuba, Ibaraki 305-8569, Japan
| | - Julia Schwartzman
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Lu Lu
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Otto X Cordero
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| |
Collapse
|
12
|
Linard B, Ebersberger I, McGlynn SE, Glover N, Mochizuki T, Patricio M, Lecompte O, Nevers Y, Thomas PD, Gabaldón T, Sonnhammer E, Dessimoz C, Uchiyama I. Ten Years of Collaborative Progress in the Quest for Orthologs. Mol Biol Evol 2021; 38:3033-3045. [PMID: 33822172 PMCID: PMC8321534 DOI: 10.1093/molbev/msab098] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 02/07/2021] [Accepted: 04/01/2021] [Indexed: 12/19/2022] Open
Abstract
Accurate determination of the evolutionary relationships between genes is a foundational challenge in biology. Homology-evolutionary relatedness-is in many cases readily determined based on sequence similarity analysis. By contrast, whether or not two genes directly descended from a common ancestor by a speciation event (orthologs) or duplication event (paralogs) is more challenging, yet provides critical information on the history of a gene. Since 2009, this task has been the focus of the Quest for Orthologs (QFO) Consortium. The sixth QFO meeting took place in Okazaki, Japan in conjunction with the 67th National Institute for Basic Biology conference. Here, we report recent advances, applications, and oncoming challenges that were discussed during the conference. Steady progress has been made toward standardization and scalability of new and existing tools. A feature of the conference was the presentation of a panel of accessible tools for phylogenetic profiling and several developments to bring orthology beyond the gene unit-from domains to networks. This meeting brought into light several challenges to come: leveraging orthology computations to get the most of the incoming avalanche of genomic data, integrating orthology from domain to biological network levels, building better gene models, and adapting orthology approaches to the broad evolutionary and genomic diversity recognized in different forms of life and viruses.
Collapse
Affiliation(s)
- Benjamin Linard
- LIRMM, University of Montpellier, CNRS, Montpellier, France.,SPYGEN, Le Bourget-du-Lac, France
| | - Ingo Ebersberger
- Institute of Cell Biology and Neuroscience, Goethe University Frankfurt, Frankfurt, Germany.,Senckenberg Biodiversity and Climate Research Centre (S-BIKF), Frankfurt, Germany.,LOEWE Center for Translational Biodiversity Genomics (TBG), Frankfurt, Germany
| | - Shawn E McGlynn
- Earth-Life Science Institute, Tokyo Institute of Technology, Meguro, Tokyo, Japan.,Blue Marble Space Institute of Science, Seattle, WA, USA
| | - Natasha Glover
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Tomohiro Mochizuki
- Earth-Life Science Institute, Tokyo Institute of Technology, Meguro, Tokyo, Japan
| | - Mateus Patricio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Odile Lecompte
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de Médecine Translationnelle de Strasbourg, Strasbourg, France
| | - Yannis Nevers
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA
| | - Toni Gabaldón
- Barcelona Supercomputing Centre (BCS-CNS), Jordi Girona, Barcelona, Spain.,Institute for Research in Biomedicine (IRB), The Barcelona Institute of Science and Technology (BIST), Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - Erik Sonnhammer
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - Christophe Dessimoz
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.,Department of Computer Science, University College London, London, United Kingdom.,Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
| | - Ikuo Uchiyama
- Department of Theoretical Biology, National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Aichi, Japan
| | | |
Collapse
|
13
|
Altenhoff AM, Train CM, Gilbert KJ, Mediratta I, Mendes de Farias T, Moi D, Nevers Y, Radoykova HS, Rossier V, Warwick Vesztrocy A, Glover NM, Dessimoz C. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res 2021; 49:D373-D379. [PMID: 33174605 PMCID: PMC7779010 DOI: 10.1093/nar/gkaa1007] [Citation(s) in RCA: 121] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/10/2020] [Accepted: 10/14/2020] [Indexed: 01/11/2023] Open
Abstract
OMA is an established resource to elucidate evolutionary relationships among genes from currently 2326 genomes covering all domains of life. OMA provides pairwise and groupwise orthologs, functional annotations, local and global gene order conservation (synteny) information, among many other functions. This update paper describes the reorganisation of the database into gene-, group- and genome-centric pages. Other new and improved features are detailed, such as reporting of the evolutionarily best conserved isoforms of alternatively spliced genes, the inferred local order of ancestral genes, phylogenetic profiling, better cross-references, fast genome mapping, semantic data sharing via RDF, as well as a special coronavirus OMA with 119 viruses from the Nidovirales order, including SARS-CoV-2, the agent of the COVID-19 pandemic. We conclude with improvements to the documentation of the resource through primers, tutorials and short videos. OMA is accessible at https://omabrowser.org.
Collapse
Affiliation(s)
- Adrian M Altenhoff
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- ETH Zurich, Computer Science, Universitätstr. 6, 8092 Zurich, Switzerland
| | - Clément-Marie Train
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Kimberly J Gilbert
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Ishita Mediratta
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- Department of Computer Science and Information Systems, BITS Pilani K.K. Birla Goa Campus, India
| | | | - David Moi
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Yannis Nevers
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Hale-Seda Radoykova
- Centre for Life's Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, Gower St, London WC1E 6BT, United Kingdom
- Department of Computer Science, University College London, Gower St, London WC1E 6BT, United Kingdom
| | - Victor Rossier
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Alex Warwick Vesztrocy
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Natasha M Glover
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Christophe Dessimoz
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
- Centre for Life's Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, Gower St, London WC1E 6BT, United Kingdom
- Department of Computer Science, University College London, Gower St, London WC1E 6BT, United Kingdom
| |
Collapse
|
14
|
Deutekom ES, Snel B, van Dam TJP. Benchmarking orthology methods using phylogenetic patterns defined at the base of Eukaryotes. Brief Bioinform 2020; 22:5906198. [PMID: 32935832 PMCID: PMC8138875 DOI: 10.1093/bib/bbaa206] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2020] [Revised: 08/10/2020] [Accepted: 08/11/2020] [Indexed: 12/26/2022] Open
Abstract
Insights into the evolution of ancestral complexes and pathways are generally achieved through careful and time-intensive manual analysis often using phylogenetic profiles of the constituent proteins. This manual analysis limits the possibility of including more protein-complex components, repeating the analyses for updated genome sets or expanding the analyses to larger scales. Automated orthology inference should allow such large-scale analyses, but substantial differences between orthologous groups generated by different approaches are observed. We evaluate orthology methods for their ability to recapitulate a number of observations that have been made with regard to genome evolution in eukaryotes. Specifically, we investigate phylogenetic profile similarity (co-occurrence of complexes), the last eukaryotic common ancestor’s gene content, pervasiveness of gene loss and the overlap with manually determined orthologous groups. Moreover, we compare the inferred orthologies to each other. We find that most orthology methods reconstruct a large last eukaryotic common ancestor, with substantial gene loss, and can predict interacting proteins reasonably well when applying phylogenetic co-occurrence. At the same time, derived orthologous groups show imperfect overlap with manually curated orthologous groups. There is no strong indication of which orthology method performs better than another on individual or all of these aspects. Counterintuitively, despite the orthology methods behaving similarly regarding large-scale evaluation, the obtained orthologous groups differ vastly from one another. Availability and implementation The data and code underlying this article are available in github and/or upon reasonable request to the corresponding author: https://github.com/ESDeutekom/ComparingOrthologies.
Collapse
Affiliation(s)
| | - Berend Snel
- Corresponding author: Berend Snel, Padualaan 8, 358CH Utrecht, The Netherlands. Tel.: +31(0)30 253 8102; E-mail:
| | | |
Collapse
|