1
|
Podda M, Bonechi S, Palladino A, Scaramuzzino M, Brozzi A, Roma G, Muzzi A, Priami C, Sîrbu A, Bodini M. Classification of Neisseria meningitidis genomes with a bag-of-words approach and machine learning. iScience 2024; 27:109257. [PMID: 38439962 PMCID: PMC10910294 DOI: 10.1016/j.isci.2024.109257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 12/13/2023] [Accepted: 02/13/2024] [Indexed: 03/06/2024] Open
Abstract
Whole genome sequencing of bacteria is important to enable strain classification. Using entire genomes as an input to machine learning (ML) models would allow rapid classification of strains while using information from multiple genetic elements. We developed a "bag-of-words" approach to encode, using SentencePiece or k-mer tokenization, entire bacterial genomes and analyze these with ML. Initial model selection identified SentencePiece with 8,000 and 32,000 words as the best approach for genome tokenization. We then classified in Neisseria meningitidis genomes the capsule B group genotype with 99.6% accuracy and the multifactor invasive phenotype with 90.2% accuracy, in an independent test set. Subsequently, in silico knockouts of 2,808 genes confirmed that the ML model predictions aligned with our current understanding of the underlying biology. To our knowledge, this is the first ML method using entire bacterial genomes to classify strains and identify genes considered relevant by the classifier.
Collapse
Affiliation(s)
- Marco Podda
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Simone Bonechi
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
- Department of Computer Science, University of Pisa, 56127 Pisa, Italy
| | - Andrea Palladino
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | | | - Alessandro Brozzi
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Guglielmo Roma
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Alessandro Muzzi
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| | - Corrado Priami
- Department of Computer Science, University of Pisa, 56127 Pisa, Italy
| | - Alina Sîrbu
- Department of Computer Science, University of Pisa, 56127 Pisa, Italy
| | - Margherita Bodini
- Vaccines Discovery Data Sciences, GSK Vaccines, GSK, 53100 Siena, Italy
| |
Collapse
|
2
|
Fang Y, Li M, Li X, Yang Y. GFICLEE: ultrafast tree-based phylogenetic profile method inferring gene function at the genomic-wide level. BMC Genomics 2021; 22:774. [PMID: 34715785 PMCID: PMC8557005 DOI: 10.1186/s12864-021-08070-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 10/10/2021] [Indexed: 11/25/2022] Open
Abstract
Background Phylogenetic profiling is widely used to predict novel members of large protein complexes and biological pathways. Although methods combined with phylogenetic trees have significantly improved prediction accuracy, computational efficiency is still an issue that limits its genome-wise application. Results Here we introduce a new tree-based phylogenetic profiling algorithm named GFICLEE, which infers common single and continuous loss (SCL) events in the evolutionary patterns. We validated our algorithm with human pathways from three databases and compared the computational efficiency with current tree-based with 10 different scales genome dataset. Our algorithm has a better predictive performance with high computational efficiency. Conclusions The GFICLEE is a new method to infers genome-wide gene function. The accuracy and computational efficiency of GFICLEE make it possible to explore gene functions at the genome-wide level on a personal computer. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-08070-7.
Collapse
Affiliation(s)
- Yang Fang
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, People's Republic of China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu, People's Republic of China
| | - Xufeng Li
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, People's Republic of China
| | - Yi Yang
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, People's Republic of China.
| |
Collapse
|
3
|
Psomopoulos FE, van Helden J, Médigue C, Chasapi A, Ouzounis CA. Ancestral state reconstruction of metabolic pathways across pangenome ensembles. Microb Genom 2021; 6. [PMID: 32924924 PMCID: PMC7725326 DOI: 10.1099/mgen.0.000429] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
As genome sequencing efforts are unveiling the genetic diversity of the biosphere with an unprecedented speed, there is a need to accurately describe the structural and functional properties of groups of extant species whose genomes have been sequenced, as well as their inferred ancestors, at any given taxonomic level of their phylogeny. Elaborate approaches for the reconstruction of ancestral states at the sequence level have been developed, subsequently augmented by methods based on gene content. While these approaches of sequence or gene-content reconstruction have been successfully deployed, there has been less progress on the explicit inference of functional properties of ancestral genomes, in terms of metabolic pathways and other cellular processes. Herein, we describe PathTrace, an efficient algorithm for parsimony-based reconstructions of the evolutionary history of individual metabolic pathways, pivotal representations of key functional modules of cellular function. The algorithm is implemented as a five-step process through which pathways are represented as fuzzy vectors, where each enzyme is associated with a taxonomic conservation value derived from the phylogenetic profile of its protein sequence. The method is evaluated with a selected benchmark set of pathways against collections of genome sequences from key data resources. By deploying a pangenome-driven approach for pathway sets, we demonstrate that the inferred patterns are largely insensitive to noise, as opposed to gene-content reconstruction methods. In addition, the resulting reconstructions are closely correlated with the evolutionary distance of the taxa under study, suggesting that a diligent selection of target pangenomes is essential for maintaining cohesiveness of the method and consistency of the inference, serving as an internal control for an arbitrary selection of queries. The PathTrace method is a first step towards the large-scale analysis of metabolic pathway evolution and our deeper understanding of functional relationships reflected in emerging pangenome collections.
Collapse
Affiliation(s)
- Fotis E Psomopoulos
- Institute of Applied Biosciences (INAB), Center for Research & Technology Hellas (CERTH), GR-57001 Thessalonica, Greece
| | - Jacques van Helden
- Lab. Technological Advances for Genomics & Clinics (TAGC), Université d'Aix-Marseille (AMU), INSERM Unit U1090, 163, Avenue de Luminy, 13288 Marseille cedex 09, France
| | - Claudine Médigue
- UMR 8030, CNRS, Université Evry-Val-d'Essonne, CEA, Institut de Biologie François Jacob - Genoscope, Laboratoire d'Analyses Bioinformatiques pour la Génomique et le Métabolisme, Evry, France
| | - Anastasia Chasapi
- Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Center for Research & Technology Hellas (CERTH), GR-57001 Thessalonica, Greece
| | - Christos A Ouzounis
- Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Center for Research & Technology Hellas (CERTH), GR-57001 Thessalonica, Greece
| |
Collapse
|
4
|
Linard B, Ebersberger I, McGlynn SE, Glover N, Mochizuki T, Patricio M, Lecompte O, Nevers Y, Thomas PD, Gabaldón T, Sonnhammer E, Dessimoz C, Uchiyama I. Ten Years of Collaborative Progress in the Quest for Orthologs. Mol Biol Evol 2021; 38:3033-3045. [PMID: 33822172 PMCID: PMC8321534 DOI: 10.1093/molbev/msab098] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 02/07/2021] [Accepted: 04/01/2021] [Indexed: 12/19/2022] Open
Abstract
Accurate determination of the evolutionary relationships between genes is a foundational challenge in biology. Homology-evolutionary relatedness-is in many cases readily determined based on sequence similarity analysis. By contrast, whether or not two genes directly descended from a common ancestor by a speciation event (orthologs) or duplication event (paralogs) is more challenging, yet provides critical information on the history of a gene. Since 2009, this task has been the focus of the Quest for Orthologs (QFO) Consortium. The sixth QFO meeting took place in Okazaki, Japan in conjunction with the 67th National Institute for Basic Biology conference. Here, we report recent advances, applications, and oncoming challenges that were discussed during the conference. Steady progress has been made toward standardization and scalability of new and existing tools. A feature of the conference was the presentation of a panel of accessible tools for phylogenetic profiling and several developments to bring orthology beyond the gene unit-from domains to networks. This meeting brought into light several challenges to come: leveraging orthology computations to get the most of the incoming avalanche of genomic data, integrating orthology from domain to biological network levels, building better gene models, and adapting orthology approaches to the broad evolutionary and genomic diversity recognized in different forms of life and viruses.
Collapse
Affiliation(s)
- Benjamin Linard
- LIRMM, University of Montpellier, CNRS, Montpellier, France.,SPYGEN, Le Bourget-du-Lac, France
| | - Ingo Ebersberger
- Institute of Cell Biology and Neuroscience, Goethe University Frankfurt, Frankfurt, Germany.,Senckenberg Biodiversity and Climate Research Centre (S-BIKF), Frankfurt, Germany.,LOEWE Center for Translational Biodiversity Genomics (TBG), Frankfurt, Germany
| | - Shawn E McGlynn
- Earth-Life Science Institute, Tokyo Institute of Technology, Meguro, Tokyo, Japan.,Blue Marble Space Institute of Science, Seattle, WA, USA
| | - Natasha Glover
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Tomohiro Mochizuki
- Earth-Life Science Institute, Tokyo Institute of Technology, Meguro, Tokyo, Japan
| | - Mateus Patricio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Odile Lecompte
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de Médecine Translationnelle de Strasbourg, Strasbourg, France
| | - Yannis Nevers
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA
| | - Toni Gabaldón
- Barcelona Supercomputing Centre (BCS-CNS), Jordi Girona, Barcelona, Spain.,Institute for Research in Biomedicine (IRB), The Barcelona Institute of Science and Technology (BIST), Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - Erik Sonnhammer
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - Christophe Dessimoz
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.,Department of Computer Science, University College London, London, United Kingdom.,Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
| | - Ikuo Uchiyama
- Department of Theoretical Biology, National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Aichi, Japan
| | | |
Collapse
|
5
|
Sangphukieo A, Laomettachit T, Ruengjitchatchawalya M. PhotoModPlus: A web server for photosynthetic protein prediction from genome neighborhood features. PLoS One 2021; 16:e0248682. [PMID: 33730083 PMCID: PMC7968678 DOI: 10.1371/journal.pone.0248682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 03/03/2021] [Indexed: 11/20/2022] Open
Abstract
A new web server called PhotoModPlus is presented as a platform for predicting photosynthetic proteins via genome neighborhood networks (GNN) and genome neighborhood-based machine learning. GNN enables users to visualize the overview of the conserved neighboring genes from multiple photosynthetic prokaryotic genomes and provides functional guidance on the query input. In the platform, we also present a new machine learning model utilizing genome neighborhood features for predicting photosynthesis-specific functions based on 24 prokaryotic photosynthesis-related GO terms, namely PhotoModGO. The new model performed better than the sequence-based approaches with an F1 measure of 0.872, based on nested five-fold cross-validation. Finally, we demonstrated the applications of the webserver and the new model in the identification of novel photosynthetic proteins. The server is user-friendly, compatible with all devices, and available at bicep.kmutt.ac.th/photomod.
Collapse
Affiliation(s)
- Apiwat Sangphukieo
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi (KMUTT), Bang Khun Thian, Bangkok, Thailand
- School of Information Technology, KMUTT, Thung Khru, Bangkok, Thailand
| | - Teeraphan Laomettachit
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi (KMUTT), Bang Khun Thian, Bangkok, Thailand
| | - Marasri Ruengjitchatchawalya
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi (KMUTT), Bang Khun Thian, Bangkok, Thailand
- Biotechnology Program, School of Bioresources and Technology, KMUTT, Bang Khun Thian, Bangkok, Thailand
- Algal Biotechnology Research Group, Pilot Plant Development and Training Institute, KMUTT, Bang Khun Thian, Bangkok, Thailand
| |
Collapse
|
6
|
Ouzounis CA. Developing computational biology at meridian 23° E, and a little eastwards. ACTA ACUST UNITED AC 2018; 25:18. [PMID: 30460210 PMCID: PMC6237004 DOI: 10.1186/s40709-018-0091-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2018] [Accepted: 11/09/2018] [Indexed: 11/23/2022]
Abstract
Modern biology is experiencing a deep transformation by the expansion of molecular-level measurements at all scales, using omics technologies. A key element in this transformation is the field of bioinformatics, that has—in the meanwhile—permeated pretty much all of biological and biomedical research and is now emerging as a key inter-disciplinary area that connects the natural sciences, chemical and electrical engineering, science education and science policy, on a number of science and technology fronts. The strong tradition of open access for large volumes of raw data, collections of complex results and high-quality algorithm implementations in bioinformatics makes the field a unique, special case of open science. We report on our recent research activities, the development of training initiatives in the wider region during the past years, and the lessons learned regarding our efforts away from major epicenters, within the general context of open science.
Collapse
Affiliation(s)
- Christos A Ouzounis
- Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, 57001 Thessaloníki, Greece
| |
Collapse
|
7
|
Psomopoulos FE, Vitsios DM, Baichoo S, Ouzounis CA. BioPAXViz: a cytoscape application for the visual exploration of metabolic pathway evolution. Bioinformatics 2018; 33:1418-1420. [PMID: 28453679 DOI: 10.1093/bioinformatics/btw813] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2016] [Accepted: 01/06/2017] [Indexed: 11/12/2022] Open
Abstract
Summary BioPAXViz is a Cytoscape (version 3) application, providing a comprehensive framework for metabolic pathway visualization. Beyond the basic parsing, viewing and browsing roles, the main novel function that BioPAXViz provides is a visual comparative analysis of metabolic pathway topologies across pre-computed pathway phylogenomic profiles given a species phylogeny. Furthermore, BioPAXViz supports the display of hierarchical trees that allow efficient navigation through sets of variants of a single reference pathway. Thus, BioPAXViz can significantly facilitate, and contribute to, the study of metabolic pathway evolution and engineering. Availability and Implementation BioPAXViz has been developed as a Cytoscape app and is available at: https://github.com/CGU-CERTH/BioPAX.Viz. The software is distributed under the MIT License and is accompanied by example files and data. Additional documentation is available at the aforementioned GitHub repository. Contact ouzounis@certh.gr.
Collapse
Affiliation(s)
- Fotis E Psomopoulos
- Computational Genomics Unit, Institute of Applied Biosciences, Center for Research & Technology Hellas (CERTH), GR-57001 Thessalonica, Greece
| | - Dimitrios M Vitsios
- Computational Genomics Unit, Institute of Applied Biosciences, Center for Research & Technology Hellas (CERTH), GR-57001 Thessalonica, Greece.,The European Bioinformatics Institute, EMBL Cambridge Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Shakuntala Baichoo
- Department of Computer Science & Engineering, Faculty of Engineering, University of Mauritius, Reduit 80837, Mauritius
| | - Christos A Ouzounis
- Computational Genomics Unit, Institute of Applied Biosciences, Center for Research & Technology Hellas (CERTH), GR-57001 Thessalonica, Greece.,Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| |
Collapse
|
8
|
Niu Y, Liu C, Moghimyfiroozabad S, Yang Y, Alavian KN. PrePhyloPro: phylogenetic profile-based prediction of whole proteome linkages. PeerJ 2017; 5:e3712. [PMID: 28875072 PMCID: PMC5578374 DOI: 10.7717/peerj.3712] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Accepted: 07/28/2017] [Indexed: 02/05/2023] Open
Abstract
Direct and indirect functional links between proteins as well as their interactions as part of larger protein complexes or common signaling pathways may be predicted by analyzing the correlation of their evolutionary patterns. Based on phylogenetic profiling, here we present a highly scalable and time-efficient computational framework for predicting linkages within the whole human proteome. We have validated this method through analysis of 3,697 human pathways and molecular complexes and a comparison of our results with the prediction outcomes of previously published co-occurrency model-based and normalization methods. Here we also introduce PrePhyloPro, a web-based software that uses our method for accurately predicting proteome-wide linkages. We present data on interactions of human mitochondrial proteins, verifying the performance of this software. PrePhyloPro is freely available at http://prephylopro.org/phyloprofile/.
Collapse
Affiliation(s)
- Yulong Niu
- Department of Medicine, Division of Brain Sciences, Imperial College London, London, United Kingdom.,Key Lab of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan, China.,School of Medicine, Department of Internal Medicine, Endocrinology, Yale University, New Haven, CT, United States of America
| | - Chengcheng Liu
- Department of Periodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | | | - Yi Yang
- Key Lab of Bio-resources and Eco-environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan, China
| | - Kambiz N Alavian
- Department of Medicine, Division of Brain Sciences, Imperial College London, London, United Kingdom.,School of Medicine, Department of Internal Medicine, Endocrinology, Yale University, New Haven, CT, United States of America.,Department of Biology, The Bahá'í Institute for Higher Education (BIHE), Tehran, Iran
| |
Collapse
|
9
|
Nagy LG, Riley R, Bergmann PJ, Krizsán K, Martin FM, Grigoriev IV, Cullen D, Hibbett DS. Genetic Bases of Fungal White Rot Wood Decay Predicted by Phylogenomic Analysis of Correlated Gene-Phenotype Evolution. Mol Biol Evol 2016; 34:35-44. [DOI: 10.1093/molbev/msw238] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
|
10
|
Franceschini A, Lin J, von Mering C, Jensen LJ. SVD-phy: improved prediction of protein functional associations through singular value decomposition of phylogenetic profiles. Bioinformatics 2015; 32:1085-7. [PMID: 26614125 PMCID: PMC4896368 DOI: 10.1093/bioinformatics/btv696] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 11/24/2015] [Indexed: 11/15/2022] Open
Abstract
Summary: A successful approach for predicting functional associations between non-homologous genes is to compare their phylogenetic distributions. We have devised a phylogenetic profiling algorithm, SVD-Phy, which uses truncated singular value decomposition to address the problem of uninformative profiles giving rise to false positive predictions. Benchmarking the algorithm against the KEGG pathway database, we found that it has substantially improved performance over existing phylogenetic profiling methods. Availability and implementation: The software is available under the open-source BSD license at https://bitbucket.org/andrea/svd-phy Contact:lars.juhl.jensen@cpr.ku.dk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andrea Franceschini
- Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland, Swiss Institute of Bioinformatics, Quartier Sorge, Bâtiment Génopode, Lausanne, 1015, Switzerland
| | - Jianyi Lin
- Department of Computer Science, University of Milan, via Comelico 39, Milan, 20135, Italy and
| | - Christian von Mering
- Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland, Swiss Institute of Bioinformatics, Quartier Sorge, Bâtiment Génopode, Lausanne, 1015, Switzerland
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen N, 2200, Denmark
| |
Collapse
|