1
|
Nesterenko L, Blassel L, Veber P, Boussau B, Jacob L. Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks. Mol Biol Evol 2025; 42:msaf051. [PMID: 40066802 PMCID: PMC11965795 DOI: 10.1093/molbev/msaf051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 01/16/2025] [Accepted: 01/27/2025] [Indexed: 04/04/2025] Open
Abstract
Phylogenetic inference aims at reconstructing the tree describing the evolution of a set of sequences descending from a common ancestor. The high computational cost of state-of-the-art maximum likelihood and Bayesian inference methods limits their usability under realistic evolutionary models. Harnessing recent advances in likelihood-free inference and geometric deep learning, we introduce Phyloformer, a fast and accurate method for evolutionary distance estimation and phylogenetic reconstruction. Sampling many trees and sequences under an evolutionary model, we train the network to learn a function that enables predicting a tree from a multiple sequence alignment. On simulated data, we compare Phyloformer to FastME-a distance method-and two maximum likelihood methods: FastTree and IQTree. Under a commonly used model of protein sequence evolution and exploiting graphics processing unit (GPU) acceleration, Phyloformer outpaces all other approaches and exceeds their accuracy in the Kuhner-Felsenstein metric that accounts for both the topology and branch lengths. In terms of topological accuracy alone, Phyloformer outperforms FastME, but falls behind maximum likelihood approaches, especially as the number of sequences increases. When a model of sequence evolution that includes dependencies between sites is used, Phyloformer outperforms all other methods across all metrics on alignments with fewer than 80 sequences. On 3,801 empirical gene alignments from five different datasets, Phyloformer matches the topological accuracy of the two maximum likelihood implementations. Our results pave the way for the adoption of sophisticated realistic models for phylogenetic inference.
Collapse
Affiliation(s)
- Luca Nesterenko
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Luc Blassel
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Philippe Veber
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Bastien Boussau
- Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1, Villeurbanne, France
| | - Laurent Jacob
- Laboratory of Computational and Quantitative Biology, Sorbonne Université, Paris, France
| |
Collapse
|
2
|
Hermans P, Tsishyn M, Schwersensky M, Rooman M, Pucci F. Exploring Evolution to Uncover Insights Into Protein Mutational Stability. Mol Biol Evol 2025; 42:msae267. [PMID: 39786559 PMCID: PMC11721782 DOI: 10.1093/molbev/msae267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 11/27/2024] [Accepted: 11/28/2024] [Indexed: 01/12/2025] Open
Abstract
Determining the impact of mutations on the thermodynamic stability of proteins is essential for a wide range of applications such as rational protein design and genetic variant interpretation. Since protein stability is a major driver of evolution, evolutionary data are often used to guide stability predictions. Many state-of-the-art stability predictors extract evolutionary information from multiple sequence alignments of proteins homologous to a query protein, and leverage it to predict the effects of mutations on protein stability. To evaluate the power and the limitations of such methods, we used the massive amount of stability data recently obtained by deep mutational scanning to study how best to construct multiple sequence alignments and optimally extract evolutionary information from them. We tested different evolutionary models and found that, unexpectedly, independent-site models achieve similar accuracy to more complex epistatic models. A detailed analysis of the latter models suggests that their inference often results in noisy couplings, which do not appear to add predictive power over the independent-site contribution, at least in the context of stability prediction. Interestingly, by combining any of the evolutionary features with a simple structural feature, the relative solvent accessibility of the mutated residue, we achieved similar prediction accuracy to supervised, machine learning-based, protein stability change predictors. Our results provide new insights into the relationship between protein evolution and stability, and show how evolutionary information can be exploited to improve the performance of mutational stability prediction.
Collapse
Affiliation(s)
- Pauline Hermans
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Matsvei Tsishyn
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Martin Schwersensky
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| |
Collapse
|
3
|
Roestel JA, Wiersema JH, Jansen RK, Borsch T, Gruenstaeudl M. On the importance of sequence alignment inspections in plastid phylogenomics - an example from revisiting the relationships of the water-lilies. Cladistics 2024; 40:469-495. [PMID: 38761095 DOI: 10.1111/cla.12584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 04/27/2024] [Accepted: 04/29/2024] [Indexed: 05/20/2024] Open
Abstract
The water-lily clade represents the second earliest-diverging branch of angiosperms. Most of its species belong to Nymphaeaceae, of which the "core Nymphaeaceae"-comprising the genera Euryale, Nymphaea and Victoria-is the most diverse clade. Despite previous molecular phylogenetic studies on the core Nymphaeaceae, various aspects of their evolutionary relationships have remained unresolved. The length-variable introns and intergenic spacers are known to contain most of the sequence variability within the water-lily plastomes. Despite the challenges with multiple sequence alignment, any new molecular phylogenetic investigation on the core Nymphaeaceae should focus on these noncoding plastome regions. For example, a new plastid phylogenomic study on the core Nymphaeaceae should generate DNA sequence alignments of all plastid introns and intergenic spacers based on the principle of conserved sequence motifs. In this investigation, we revisit the phylogenetic history of the core Nymphaeaceae by employing such an approach. Specifically, we use a plastid phylogenomic analysis strategy in which all coding and noncoding partitions are separated and then undergo software-driven DNA sequence alignment, followed by a motif-based alignment inspection and adjustment. This approach allows us to increase the reliability of the character base compared to the default practice of aligning complete plastomes through software algorithms alone. Our approach produces significantly different phylogenetic tree reconstructions for several of the plastome regions under study. The results of these reconstructions underscore that Nymphaea is paraphyletic in its current circumscription, that each of the five subgenera of Nymphaea is monophyletic, and that the subgenus Nymphaea is sister to all other subgenera of Nymphaea. Our results also clarify many evolutionary relationships within the Nymphaea subgenera Brachyceras, Hydrocallis and Nymphaea. In closing, we discuss whether the phylogenetic reconstructions obtained through our motif-based alignment adjustments are in line with morphological evidence on water-lily evolution.
Collapse
Affiliation(s)
- Jessica A Roestel
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
| | - John H Wiersema
- Department of Botany, National Museum of Natural History - Smithsonian Institution, Washington, DC, 37012, USA
| | - Robert K Jansen
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, 78712, USA
| | - Thomas Borsch
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, 14195, Berlin, Germany
| | - Michael Gruenstaeudl
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
- Department of Biological Sciences, Fort Hays State University, Hays, KS, 67601, USA
| |
Collapse
|
4
|
Farias JG, Herrera-Belén L, Jimenez L, Beltrán JF. PROTA: A Robust Tool for Protamine Prediction Using a Hybrid Approach of Machine Learning and Deep Learning. Int J Mol Sci 2024; 25:10267. [PMID: 39408595 PMCID: PMC11476296 DOI: 10.3390/ijms251910267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Revised: 09/18/2024] [Accepted: 09/23/2024] [Indexed: 10/20/2024] Open
Abstract
Protamines play a critical role in DNA compaction and stabilization in sperm cells, significantly influencing male fertility and various biotechnological applications. Traditionally, identifying these proteins is a challenging and time-consuming process due to their species-specific variability and complexity. Leveraging advancements in computational biology, we present PROTA, a novel tool that combines machine learning (ML) and deep learning (DL) techniques to predict protamines with high accuracy. For the first time, we integrate Generative Adversarial Networks (GANs) with supervised learning methods to enhance the accuracy and generalizability of protamine prediction. Our methodology evaluated multiple ML models, including Light Gradient-Boosting Machine (LIGHTGBM), Multilayer Perceptron (MLP), Random Forest (RF), eXtreme Gradient Boosting (XGBOOST), k-Nearest Neighbors (KNN), Logistic Regression (LR), Naive Bayes (NB), and Radial Basis Function-Support Vector Machine (RBF-SVM). During ten-fold cross-validation on our training dataset, the MLP model with GAN-augmented data demonstrated superior performance metrics: 0.997 accuracy, 0.997 F1 score, 0.998 precision, 0.997 sensitivity, and 1.0 AUC. In the independent testing phase, this model achieved 0.999 accuracy, 0.999 F1 score, 1.0 precision, 0.999 sensitivity, and 1.0 AUC. These results establish PROTA, accessible via a user-friendly web application. We anticipate that PROTA will be a crucial resource for researchers, enabling the rapid and reliable prediction of protamines, thereby advancing our understanding of their roles in reproductive biology, biotechnology, and medicine.
Collapse
Affiliation(s)
- Jorge G. Farias
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco 4811230, Chile; (J.G.F.); (L.J.)
| | - Lisandra Herrera-Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Temuco 4780000, Chile;
| | - Luis Jimenez
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco 4811230, Chile; (J.G.F.); (L.J.)
| | - Jorge F. Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco 4811230, Chile; (J.G.F.); (L.J.)
| |
Collapse
|
5
|
Becker F, Stanke M. learnMSA2: deep protein multiple alignments with large language and hidden Markov models. Bioinformatics 2024; 40:ii79-ii86. [PMID: 39230690 PMCID: PMC11373405 DOI: 10.1093/bioinformatics/btae381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
MOTIVATION For the alignment of large numbers of protein sequences, tools are predominant that decide to align two residues using only simple prior knowledge, e.g. amino acid substitution matrices, and using only part of the available data. The accuracy of state-of-the-art programs declines with decreasing sequence identity and when increasingly large numbers of sequences are aligned. Recently, transformer-based deep-learning models started to harness the vast amount of protein sequence data, resulting in powerful pretrained language models with the main purpose of generating high-dimensional numerical representations, embeddings, for individual sites that agglomerate evolutionary, structural, and biophysical information. RESULTS We extend the traditional profile hidden Markov model so that it takes as inputs unaligned protein sequences and the corresponding embeddings. We fit the model with gradient descent using our existing differentiable hidden Markov layer. All sequences and their embeddings are jointly aligned to a model of the protein family. We report that our upgraded HMM-based aligner, learnMSA2, combined with the ProtT5-XL protein language model aligns on average almost 6% points more columns correctly than the best amino acid-based competitor and scales well with sequence number. The relative advantage of learnMSA2 over other programs tends to be greater when the sequence identity is lower and when the number of sequences is larger. Our results strengthen the evidence on the rich information contained in protein language models' embeddings and their potential downstream impact on the field of bioinformatics. Availability and implementation: https://github.com/Gaius-Augustus/learnMSA, PyPI and Bioconda, evaluation: https://github.com/felbecker/snakeMSA.
Collapse
Affiliation(s)
- Felix Becker
- Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany
| |
Collapse
|
6
|
Bryant P, Noé F. Improved protein complex prediction with AlphaFold-multimer by denoising the MSA profile. PLoS Comput Biol 2024; 20:e1012253. [PMID: 39052676 DOI: 10.1371/journal.pcbi.1012253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 08/06/2024] [Accepted: 06/14/2024] [Indexed: 07/27/2024] Open
Abstract
Structure prediction of protein complexes has improved significantly with AlphaFold2 and AlphaFold-multimer (AFM), but only 60% of dimers are accurately predicted. Here, we learn a bias to the MSA representation that improves the predictions by performing gradient descent through the AFM network. We demonstrate the performance on seven difficult targets from CASP15 and increase the average MMscore to 0.76 compared to 0.63 with AFM. We evaluate the procedure on 487 protein complexes where AFM fails and obtain an increased success rate (MMscore>0.75) of 33% on these difficult targets. Our protocol, AFProfile, provides a way to direct predictions towards a defined target function guided by the MSA. We expect gradient descent over the MSA to be useful for different tasks.
Collapse
Affiliation(s)
- Patrick Bryant
- Department of Mathematics and Informatics, Freie Universität Berlin, Germany
- The Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University, Stockholm, Sweden
- Science for Life Laboratory, Solna, Sweden
| | - Frank Noé
- Department of Mathematics and Informatics, Freie Universität Berlin, Germany
- Microsoft Research AI4Science, Berlin, Germany
| |
Collapse
|
7
|
Macaulay M, Fourment M. Differentiable phylogenetics via hyperbolic embeddings with Dodonaphy. BIOINFORMATICS ADVANCES 2024; 4:vbae082. [PMID: 39132286 PMCID: PMC11310108 DOI: 10.1093/bioadv/vbae082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 05/23/2024] [Accepted: 06/05/2024] [Indexed: 08/13/2024]
Abstract
Motivation Navigating the high dimensional space of discrete trees for phylogenetics presents a challenging problem for tree optimization. To address this, hyperbolic embeddings of trees offer a promising approach to encoding trees efficiently in continuous spaces. However, they require a differentiable tree decoder to optimize the phylogenetic likelihood. We present soft-NJ, a differentiable version of neighbour joining that enables gradient-based optimization over the space of trees. Results We illustrate the potential for differentiable optimization over tree space for maximum likelihood inference. We then perform variational Bayesian phylogenetics by optimizing embedding distributions in hyperbolic space. We compare the performance of this approximation technique on eight benchmark datasets to state-of-the-art methods. Results indicate that, while this technique is not immune from local optima, it opens a plethora of powerful and parametrically efficient approach to phylogenetics via tree embeddings. Availability and implementation Dodonaphy is freely available on the web at https://www.github.com/mattapow/dodonaphy. It includes an implementation of soft-NJ.
Collapse
Affiliation(s)
- Matthew Macaulay
- Australian Institute for Microbiology & Infection, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Mathieu Fourment
- Australian Institute for Microbiology & Infection, University of Technology Sydney, Ultimo, NSW 2007, Australia
| |
Collapse
|
8
|
Johnson SR, Peshwa M, Sun Z. Sensitive remote homology search by local alignment of small positional embeddings from protein language models. eLife 2024; 12:RP91415. [PMID: 38488154 PMCID: PMC10942778 DOI: 10.7554/elife.91415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2024] Open
Abstract
Accurately detecting distant evolutionary relationships between proteins remains an ongoing challenge in bioinformatics. Search methods based on primary sequence struggle to accurately detect homology between sequences with less than 20% amino acid identity. Profile- and structure-based strategies extend sensitive search capabilities into this twilight zone of sequence similarity but require slow pre-processing steps. Recently, whole-protein and positional embeddings from deep neural networks have shown promise for providing sensitive sequence comparison and annotation at long evolutionary distances. Embeddings are generally faster to compute than profiles and predicted structures but still suffer several drawbacks related to the ability of whole-protein embeddings to discriminate domain-level homology, and the database size and search speed of methods using positional embeddings. In this work, we show that low-dimensionality positional embeddings can be used directly in speed-optimized local search algorithms. As a proof of concept, we use the ESM2 3B model to convert primary sequences directly into the 3D interaction (3Di) alphabet or amino acid profiles and use these embeddings as input to the highly optimized Foldseek, HMMER3, and HH-suite search algorithms. Our results suggest that positional embeddings as small as a single byte can provide sufficient information for dramatically improved sensitivity over amino acid sequence searches without sacrificing search speed.
Collapse
Affiliation(s)
| | | | - Zhiyi Sun
- New England Biolabs IncIpswichUnited States
| |
Collapse
|
9
|
Matthies MC, Krueger R, Torda AE, Ward M. Differentiable partition function calculation for RNA. Nucleic Acids Res 2024; 52:e14. [PMID: 38038257 PMCID: PMC10853804 DOI: 10.1093/nar/gkad1168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 10/24/2023] [Accepted: 11/28/2023] [Indexed: 12/02/2023] Open
Abstract
Ribonucleic acid (RNA) is an essential molecule in a wide range of biological functions. In 1990, McCaskill introduced a dynamic programming algorithm for computing the partition function of an RNA sequence. McCaskill's algorithm is widely used today for understanding the thermodynamic properties of RNA. In this work, we introduce a generalization of McCaskill's algorithm that is well-defined over continuous inputs. Crucially, this enables us to implement an end-to-end differentiable partition function calculation. The derivative can be computed with respect to the input, or to any other fixed values, such as the parameters of the energy model. This builds a bridge between RNA thermodynamics and the tools of differentiable programming including deep learning as it enables the partition function to be incorporated directly into any end-to-end differentiable pipeline. To demonstrate the effectiveness of our new approach, we tackle the inverse folding problem directly using gradient optimization. We find that using the gradient to optimize the sequence directly is sufficient to arrive at sequences with a high probability of folding into the desired structure. This indicates that the gradients we compute are meaningful.
Collapse
Affiliation(s)
- Marco C Matthies
- Centre for Bioinformatics, University of Hamburg, Bundesstr. 43, 20146 Hamburg, Germany
| | - Ryan Krueger
- Department of Applied Mathematics, Harvard University, 29 Oxford St, Cambridge, MA 02138, USA
| | - Andrew E Torda
- Centre for Bioinformatics, University of Hamburg, Bundesstr. 43, 20146 Hamburg, Germany
| | - Max Ward
- Department of Computer Science and Software Engineering, The University of Western Australia, 241, 35 Stirling Hwy, Crawley, WA 6009, Australia
| |
Collapse
|
10
|
Stein RA, Mchaourab HS. Rosetta Energy Analysis of AlphaFold2 models: Point Mutations and Conformational Ensembles. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.05.556364. [PMID: 37732281 PMCID: PMC10508732 DOI: 10.1101/2023.09.05.556364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
There has been an explosive growth in the applications of AlphaFold2, and other structure prediction platforms, to accurately predict protein structures from a multiple sequence alignment (MSA) for downstream structural analysis. However, two outstanding questions persist in the field regarding the robustness of AlphaFold2 predictions of the consequences of point mutations and the completeness of its prediction of protein conformational ensembles. We combined our previously developed method SPEACH_AF with model relaxation and energetic analysis with Rosetta to address these questions. SPEACH_AF introduces residue substitutions across the MSA and not just within the input sequence. With respect to conformational ensembles, we combined SPEACH_AF and a new MSA subsampling method, AF_cluster, and for a benchmarked set of proteins, we found that the energetics of the conformational ensembles generated by AlphaFold2 correspond to those of experimental structures and explored by standard molecular dynamic methods. With respect to point mutations, we compared the structural and energetic consequences of having the mutation(s) in the input sequence versus in the whole MSA (SPEACH_AF). Both methods yielded models different from the wild-type sequence, with more robust changes when the mutation(s) were in the whole MSA. While our findings demonstrate the robustness of AlphaFold2 in analyzing point mutations and exploring conformational ensembles, they highlight the need for multi parameter structural and energetic analyses of these models to generate experimentally testable hypotheses.
Collapse
Affiliation(s)
- Richard A Stein
- Department of Molecular Physiology and Biophysics and Center for Applied AI in Protein Dynamics Vanderbilt University
| | - Hassane S Mchaourab
- Department of Molecular Physiology and Biophysics and Center for Applied AI in Protein Dynamics Vanderbilt University
| |
Collapse
|
11
|
Chandra A, Sharma A, Dehzangi I, Tsunoda T, Sattar A. PepCNN deep learning tool for predicting peptide binding residues in proteins using sequence, structural, and language model features. Sci Rep 2023; 13:20882. [PMID: 38016996 PMCID: PMC10684570 DOI: 10.1038/s41598-023-47624-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 11/16/2023] [Indexed: 11/30/2023] Open
Abstract
Protein-peptide interactions play a crucial role in various cellular processes and are implicated in abnormal cellular behaviors leading to diseases such as cancer. Therefore, understanding these interactions is vital for both functional genomics and drug discovery efforts. Despite a significant increase in the availability of protein-peptide complexes, experimental methods for studying these interactions remain laborious, time-consuming, and expensive. Computational methods offer a complementary approach but often fall short in terms of prediction accuracy. To address these challenges, we introduce PepCNN, a deep learning-based prediction model that incorporates structural and sequence-based information from primary protein sequences. By utilizing a combination of half-sphere exposure, position specific scoring matrices from multiple-sequence alignment tool, and embedding from a pre-trained protein language model, PepCNN outperforms state-of-the-art methods in terms of specificity, precision, and AUC. The PepCNN software and datasets are publicly available at https://github.com/abelavit/PepCNN.git .
Collapse
Affiliation(s)
- Abel Chandra
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Iman Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, USA
- Center for Computational and Integrative Biology, Rutgers University, Camden, USA
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia
| |
Collapse
|
12
|
Abakarova M, Marquet C, Rera M, Rost B, Laine E. Alignment-based Protein Mutational Landscape Prediction: Doing More with Less. Genome Biol Evol 2023; 15:evad201. [PMID: 37936309 PMCID: PMC10653582 DOI: 10.1093/gbe/evad201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 10/27/2023] [Accepted: 11/01/2023] [Indexed: 11/09/2023] Open
Abstract
The wealth of genomic data has boosted the development of computational methods predicting the phenotypic outcomes of missense variants. The most accurate ones exploit multiple sequence alignments, which can be costly to generate. Recent efforts for democratizing protein structure prediction have overcome this bottleneck by leveraging the fast homology search of MMseqs2. Here, we show the usefulness of this strategy for mutational outcome prediction through a large-scale assessment of 1.5M missense variants across 72 protein families. Our study demonstrates the feasibility of producing alignment-based mutational landscape predictions that are both high-quality and compute-efficient for entire proteomes. We provide the community with the whole human proteome mutational landscape and simplified access to our predictive pipeline.
Collapse
Affiliation(s)
- Marina Abakarova
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), Sorbonne Université, UMR 7238, Paris 75005, France
- Université Paris Cité, INSERM UMR U1284, 75004 Paris, France
| | - Céline Marquet
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748 Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Michael Rera
- Université Paris Cité, INSERM UMR U1284, 75004 Paris, France
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748 Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, 85748 Munich, Germany
- TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| | - Elodie Laine
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), Sorbonne Université, UMR 7238, Paris 75005, France
- Institut universitaire de France (IUF)
| |
Collapse
|
13
|
McWhite CD, Armour-Garb I, Singh M. Leveraging protein language models for accurate multiple sequence alignments. Genome Res 2023; 33:1145-1153. [PMID: 37414576 PMCID: PMC10538487 DOI: 10.1101/gr.277675.123] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 06/29/2023] [Indexed: 07/08/2023]
Abstract
Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.
Collapse
Affiliation(s)
- Claire D McWhite
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA;
| | - Isabel Armour-Garb
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
- Department of Computer Science, Princeton University, Princeton, New Jersey 08544, USA
| | - Mona Singh
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA;
- Department of Computer Science, Princeton University, Princeton, New Jersey 08544, USA
| |
Collapse
|
14
|
Hou J, Wei H, Liu B. iPiDA-SWGCN: Identification of piRNA-disease associations based on Supplementarily Weighted Graph Convolutional Network. PLoS Comput Biol 2023; 19:e1011242. [PMID: 37339125 DOI: 10.1371/journal.pcbi.1011242] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Accepted: 06/05/2023] [Indexed: 06/22/2023] Open
Abstract
Accurately identifying potential piRNA-disease associations is of great importance in uncovering the pathogenesis of diseases. Recently, several machine-learning-based methods have been proposed for piRNA-disease association detection. However, they are suffering from the high sparsity of piRNA-disease association network and the Boolean representation of piRNA-disease associations ignoring the confidence coefficients. In this study, we propose a supplementarily weighted strategy to solve these disadvantages. Combined with Graph Convolutional Networks (GCNs), a novel predictor called iPiDA-SWGCN is proposed for piRNA-disease association prediction. There are three main contributions of iPiDA-SWGCN: (i) Potential piRNA-disease associations are preliminarily supplemented in the sparse piRNA-disease network by integrating various basic predictors to enrich network structure information. (ii) The original Boolean piRNA-disease associations are assigned with different relevance confidence to learn node representations from neighbour nodes in varying degrees. (iii) The experimental results show that iPiDA-SWGCN achieves the best performance compared with the other state-of-the-art methods, and can predict new piRNA-disease associations.
Collapse
Affiliation(s)
- Jialu Hou
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Hang Wei
- School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
15
|
Budzynski L, Pagnani A. Small-coupling expansion for multiple sequence alignment. Phys Rev E 2023; 107:044125. [PMID: 37198812 DOI: 10.1103/physreve.107.044125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 03/27/2023] [Indexed: 05/19/2023]
Abstract
The alignment of biological sequences such as DNA, RNA, and proteins, is one of the basic tools that allow to detect evolutionary patterns, as well as functional or structural characterizations between homologous sequences in different organisms. Typically, state-of-the-art bioinformatics tools are based on profile models that assume the statistical independence of the different sites of the sequences. Over the last years, it has become increasingly clear that homologous sequences show complex patterns of long-range correlations over the primary sequence as a consequence of the natural evolution process that selects genetic variants under the constraint of preserving the functional or structural determinants of the sequence. Here, we present an alignment algorithm based on message passing techniques that overcomes the limitations of profile models. Our method is based on a perturbative small-coupling expansion of the free energy of the model that assumes a linear chain approximation as the zeroth-order of the expansion. We test the potentiality of the algorithm against standard competing strategies on several biological sequences.
Collapse
Affiliation(s)
- Louise Budzynski
- DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, 24, I-10129, Torino, Italy
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo, Italy
| | - Andrea Pagnani
- DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, 24, I-10129, Torino, Italy
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo, Italy
- INFN, Sezione di Torino, Torino, Via Pietro Giuria, 1 10125 Torino Italy
| |
Collapse
|
16
|
Llinares-López F, Berthet Q, Blondel M, Teboul O, Vert JP. Deep embedding and alignment of protein sequences. Nat Methods 2023; 20:104-111. [PMID: 36522501 DOI: 10.1038/s41592-022-01700-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Accepted: 10/24/2022] [Indexed: 12/23/2022]
Abstract
Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.
Collapse
|