1
|
Asim MN, Asif T, Hassan F, Dengel A. Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models. Database (Oxford) 2025; 2025:baaf027. [PMID: 40448683 DOI: 10.1093/database/baaf027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 02/06/2025] [Accepted: 03/26/2025] [Indexed: 06/02/2025]
Abstract
Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Tayyaba Asif
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Faiza Hassan
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| |
Collapse
|
2
|
Kim S, Kim MA, Kim B, Lee J, Jung SK, Kim J, Chung HY, Lee CY, Jeong S. Machine learning assessment of zoonotic potential in avian influenza viruses using PB2 segment. BMC Genomics 2025; 26:395. [PMID: 40269678 PMCID: PMC12020041 DOI: 10.1186/s12864-025-11589-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Accepted: 04/09/2025] [Indexed: 04/25/2025] Open
Abstract
BACKGROUND Influenza A virus (IAV) is a major global health threat, causing seasonal epidemics and occasional pandemics. Particularly, Influenza A viruses from avian species pose significant zoonotic threats, with PB2 adaptation serving as a critical first step in cross-species transmission. A comprehensive risk assessment framework based on PB2 sequences is necessary, which should encompass detailed analyses of specific residues and mutations while maintaining sufficient generality for application to non-PB2 segments. RESULTS In this study, we developed two complementary approaches: a regression-based model for accurately distinguishing among risk groups, and a SHAP-based risk assessment model for more meaningful risk analyses. For the regression-based risk models, we compared various methodologies, including tree ensemble methods, conventional regression models, and deep learning architectures. The optimized regression model, combined with SHAP value analysis, identified and ranked individual residues contributing to zoonotic potential. The SHAP-based risk model enabled intra-class analyses within the zoonotic risk assessment framework and quantified risk yields from specific mutations. CONCLUSION Experimental analyses demonstrated that the Random Forest regression model outperformed other models in most cases, and we validated the target value settings for risk regression through ablation studies. Our SHAP-based analysis identified key residues (271A, 627K, 591R, 588A, 292I, 684S, 684A, 81M, 199S, and 368Q) and mutations (T271A, Q368R/K, E627K, Q591R, A588T/I/V, and I292V/T) critical for zoonotic risk assessment. Using the SHAP-based risk assessment model, we found that influenza A viruses from Phasianidae showed elevated zoonotic risk scores compared to those from other avian species. Additionally, mutations I292V/T, Q368R, A588T/I, V598A/I/T, and E/V627K were identified as significant mutations in the Phasianidae. These PB2-focused quantitative methods provide a robust and generalizable framework for both rapid screening of avians' zoonotic potential and analytical quantification of risks associated with specific residues or mutations.
Collapse
Affiliation(s)
- Sangwook Kim
- Bio-medical Research Institute, Kyungpook National University Hospital, Daegu, South Korea
| | - Min-Ah Kim
- Department of Microbiology, School of Medicine, Kyungpook National University, Daegu, South Korea
| | - Bitgoeul Kim
- Department of Microbiology, School of Medicine, Kyungpook National University, Daegu, South Korea
| | - Jisu Lee
- Department of Microbiology, School of Medicine, Kyungpook National University, Daegu, South Korea
| | - Se-Kyung Jung
- Department of Microbiology, School of Medicine, Kyungpook National University, Daegu, South Korea
| | - Jonghong Kim
- Department of Neurology, Keimyung University Dongsan Medical Center, Daegu, South Korea
| | - Ho-Young Chung
- Department of Medical Informatics, School of Medicine, Kyungpook National University, Daegu, South Korea
| | - Chung-Young Lee
- Department of Microbiology, School of Medicine, Kyungpook National University, Daegu, South Korea.
- Untreatable Infectious Disease Institute, Kyungpook National University, Daegu, South Korea.
| | - Sungmoon Jeong
- Department of Medical Informatics, School of Medicine, Kyungpook National University, Daegu, South Korea.
- Research Center for Artificial Intelligence in Medicine, Kyungpook National University Hospital, Daegu, South Korea.
| |
Collapse
|
3
|
Li J, Chen X, Huang H, Zeng M, Yu J, Gong X, Ye Q. $\mathcal{S}$ able: bridging the gap in protein structure understanding with an empowering and versatile pre-training paradigm. Brief Bioinform 2025; 26:bbaf120. [PMID: 40163822 PMCID: PMC11957296 DOI: 10.1093/bib/bbaf120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2024] [Revised: 01/23/2025] [Accepted: 02/23/2025] [Indexed: 04/02/2025] Open
Abstract
Protein pre-training has emerged as a transformative approach for solving diverse biological tasks. While many contemporary methods focus on sequence-based language models, recent findings highlight that protein sequences alone are insufficient to capture the extensive information inherent in protein structures. Recognizing the crucial role of protein structure in defining function and interactions, we introduce $\mathcal{S}$able, a versatile pre-training model designed to comprehensively understand protein structures. $\mathcal{S}$able incorporates a novel structural encoding mechanism that enhances inter-atomic information exchange and spatial awareness, combined with robust pre-training strategies and lightweight decoders optimized for specific downstream tasks. This approach enables $\mathcal{S}$able to consistently outperform existing methods in tasks such as generation, classification, and regression, demonstrating its superior capability in protein structure representation. The code and models can be accessed via GitHub repository at https://github.com/baaihealth/Sable.
Collapse
Affiliation(s)
- Jiashan Li
- Institute for Mathematical Sciences, Renmin University of China, 59 Zhongguancun Street, Beijing 100872, China
| | - Xi Chen
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - He Huang
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Mingliang Zeng
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Jingcheng Yu
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Xinqi Gong
- Institute for Mathematical Sciences, Renmin University of China, 59 Zhongguancun Street, Beijing 100872, China
| | - Qiwei Ye
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| |
Collapse
|
4
|
Yang D, Peng X, Zheng S, Peng S. Deep learning-based prediction of autoimmune diseases. Sci Rep 2025; 15:4576. [PMID: 39920178 PMCID: PMC11806040 DOI: 10.1038/s41598-025-88477-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Accepted: 01/28/2025] [Indexed: 02/09/2025] Open
Abstract
Autoimmune Diseases are a complex group of diseases caused by the immune system mistakenly attacking body tissues. Their etiology involves multiple factors such as genetics, environmental factors, and abnormalities in immune cells, making prediction and treatment challenging. T cells, as a core component of the immune system, play a critical role in the human immune system and have a significant impact on the pathogenesis of autoimmune diseases. Several studies have demonstrated that T-cell receptors (TCRs) may be involved in the pathogenesis of various autoimmune diseases, which provides strong theoretical support and new therapeutic targets for the prediction and treatment of autoimmune diseases. This study focuses on the prediction of several autoimmune diseases mediated by T cells, and proposes two models: one is the AutoY model based on convolutional neural networks, and the other is the LSTMY model, a bidirectional LSTM network model that integrates the attention mechanism. Experimental results show that both models exhibit good performance in the prediction of the four autoimmune diseases, with the AutoY model performing slightly better in comparison. In particular, the average area under the ROC curve (AUC) of the AutoY model exceeded 0.93 in the prediction of all the diseases, and the AUC value reached 0.99 in two diseases, type 1 diabetes and multiple sclerosis. These results demonstrate the high accuracy, stability, and good generalization ability of the two models, which makes them promising tools in the field of autoimmune disease prediction and provides support for the use of the TCR bank for the noninvasive detection of autoimmune disease non-invasive detection is supported.
Collapse
Affiliation(s)
- Donghong Yang
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Xin Peng
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Senlin Zheng
- Third Institute of Oceanography, Ministry of Natural Resources, Xiamen, 361005, China
| | - Shenglan Peng
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China.
| |
Collapse
|
5
|
Wang D, Pourmirzaei M, Abbas UL, Zeng S, Manshour N, Esmaili F, Poudel B, Jiang Y, Shao Q, Chen J, Xu D. S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025; 12:e2404212. [PMID: 39665266 PMCID: PMC11791933 DOI: 10.1002/advs.202404212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Revised: 08/21/2024] [Indexed: 12/13/2024]
Abstract
Proteins play an essential role in various biological and engineering processes. Large protein language models (PLMs) present excellent potential to reshape protein research by accelerating the determination of protein functions and the design of proteins with the desired functions. The prediction and design capacity of PLMs relies on the representation gained from the protein sequences. However, the lack of crucial 3D structure information in most PLMs restricts the prediction capacity of PLMs in various applications, especially those heavily dependent on 3D structures. To address this issue, S-PLM is introduced as a 3D structure-aware PLM that utilizes multi-view contrastive learning to align the sequence and 3D structure of a protein in a coordinated latent space. S-PLM applies Swin-Transformer on AlphaFold-predicted protein structures to embed the structural information and fuses it into sequence-based embedding from ESM2. Additionally, a library of lightweight tuning tools is provided to adapt S-PLM for diverse downstream protein prediction tasks. The results demonstrate S-PLM's superior performance over sequence-only PLMs on all protein clustering and classification tasks, achieving competitiveness comparable to state-of-the-art methods requiring both sequence and structure inputs. S-PLM and its lightweight tuning tools are available at https://github.com/duolinwang/S-PLM/.
Collapse
Affiliation(s)
- Duolin Wang
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences CenterUniversity of MissouriColumbiaMO65211USA
| | - Mahdi Pourmirzaei
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences CenterUniversity of MissouriColumbiaMO65211USA
| | - Usman L. Abbas
- Chemical & Materials EngineeringUniversity of KentuckyLexingtonKY40506USA
| | - Shuai Zeng
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences CenterUniversity of MissouriColumbiaMO65211USA
| | - Negin Manshour
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences CenterUniversity of MissouriColumbiaMO65211USA
| | - Farzaneh Esmaili
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences CenterUniversity of MissouriColumbiaMO65211USA
| | - Biplab Poudel
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences CenterUniversity of MissouriColumbiaMO65211USA
| | - Yuexu Jiang
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences CenterUniversity of MissouriColumbiaMO65211USA
| | - Qing Shao
- Chemical & Materials EngineeringUniversity of KentuckyLexingtonKY40506USA
| | - Jin Chen
- Department of Medicine and Department of Biomedical Informatics and Data ScienceUniversity of Alabama at BirminghamBirminghamAL35294USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences CenterUniversity of MissouriColumbiaMO65211USA
| |
Collapse
|
6
|
Wang M, Wang J, Ji J, Ma C, Wang H, He J, Song Y, Zhang X, Cao Y, Dai Y, Hua M, Qin R, Li K, Cao L. Improving compound-protein interaction prediction by focusing on intra-modality and inter-modality dynamics with a multimodal tensor fusion strategy. Comput Struct Biotechnol J 2024; 23:3714-3729. [PMID: 39525082 PMCID: PMC11544084 DOI: 10.1016/j.csbj.2024.10.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 10/01/2024] [Accepted: 10/01/2024] [Indexed: 11/16/2024] Open
Abstract
Identifying novel compound-protein interactions (CPIs) plays a pivotal role in target identification and drug discovery. Although the recent multimodal methods have achieved outstanding advances in CPI prediction, they fail to effectively learn both intra-modality and inter-modality dynamics, which limits their prediction performance. To address the limitation, we propose a novel multimodal tensor fusion CPI prediction framework, named MMTF-CPI, which contains three unimodal learning modules for structure, heterogeneous network and transcriptional profiling modalities, a tensor fusion module and a prediction module. MMTF-CPI is capable of focusing on both intra-modality and inter-modality dynamics with the tensor fusion module. We demonstrated that MMTF-CPI is superior to multiple state-of-the-art multimodal methods across seven datasets. The prediction performance of MMTF-CPI is significantly improved with the tensor fusion module compared to other fusion methods. Moreover, our case studies confirmed the practical value of MMTF-CPI in target identification. Via MMTF-CPI, we also discovered several candidate compounds for the therapy of breast cancer and non-small cell lung cancer.
Collapse
Affiliation(s)
- Meng Wang
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Jianmin Wang
- Department of Integrative Biotechnology, Yonsei University, Incheon 21983, South Korea
| | - Jianxin Ji
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Chenjing Ma
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Hesong Wang
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Jia He
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Yongzhen Song
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Xuan Zhang
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Yong Cao
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Yanyan Dai
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Menglei Hua
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Ruihao Qin
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Kang Li
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| | - Lei Cao
- Department of Biostatistics, Harbin Medical University, Harbin 150081, China
| |
Collapse
|
7
|
Qin X, Zhang L, Liu M, Liu G. PRFold-TNN: Protein Fold Recognition With an Ensemble Feature Selection Method Using PageRank Algorithm Based on Transformer. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1740-1751. [PMID: 38875077 DOI: 10.1109/tcbb.2024.3414497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2024]
Abstract
Understanding the tertiary structures of proteins is of great benefit to function in many aspects of human life. Protein fold recognition is a vital and salient means to know protein structure. Until now, researchers have successively proposed a variety of methods to realize protein fold recognition, but the novel and effective computational method is still needed to handle this problem with the continuous updating of protein structure databases. In this study, we develop a new protein structure dataset named AT and propose the PRFold-TNN model for protein fold recognition. First, different types of feature extraction methods including AAC, HMM, HMM-Bigram and ACC are selected to extract corresponding features for protein sequences. Then an ensemble feature selection method based on PageRank algorithm integrating various tree-based algorithms is used to screen the fusion features. Ultimately, the classifier based on the Transformer model achieves the final prediction. Experiments show that the prediction accuracy is 86.27% on the AT dataset and 88.91% on the independent test set, indicating that the model can demonstrate superior performance and generalization ability in the problem of protein fold recognition. Furthermore, we also carry out research on the DD, EDD and TG benchmark datasets, and make them achieve prediction accuracy of 88.41%, 97.91% and 95.16%, which are at least 3.0%, 0.8% and 2.5% higher than those of the state-of-the-art methods. It can be concluded that the PRFold-TNN model is more prominent.
Collapse
|
8
|
Satalkar V, Degaga GD, Li W, Pang YT, McShan AC, Gumbart JC, Mitchell JC, Torres MP. Generative β-hairpin design using a residue-based physicochemical property landscape. Biophys J 2024; 123:2790-2806. [PMID: 38297834 PMCID: PMC11393682 DOI: 10.1016/j.bpj.2024.01.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/20/2023] [Accepted: 01/25/2024] [Indexed: 02/02/2024] Open
Abstract
De novo peptide design is a new frontier that has broad application potential in the biological and biomedical fields. Most existing models for de novo peptide design are largely based on sequence homology that can be restricted based on evolutionarily derived protein sequences and lack the physicochemical context essential in protein folding. Generative machine learning for de novo peptide design is a promising way to synthesize theoretical data that are based on, but unique from, the observable universe. In this study, we created and tested a custom peptide generative adversarial network intended to design peptide sequences that can fold into the β-hairpin secondary structure. This deep neural network model is designed to establish a preliminary foundation of the generative approach based on physicochemical and conformational properties of 20 canonical amino acids, for example, hydrophobicity and residue volume, using extant structure-specific sequence data from the PDB. The beta generative adversarial network model robustly distinguishes secondary structures of β hairpin from α helix and intrinsically disordered peptides with an accuracy of up to 96% and generates artificial β-hairpin peptide sequences with minimum sequence identities around 31% and 50% when compared against the current NCBI PDB and nonredundant databases, respectively. These results highlight the potential of generative models specifically anchored by physicochemical and conformational property features of amino acids to expand the sequence-to-structure landscape of proteins beyond evolutionary limits.
Collapse
Affiliation(s)
- Vardhan Satalkar
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia
| | - Gemechis D Degaga
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee
| | - Wei Li
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia
| | - Yui Tik Pang
- School of Physics, Georgia Institute of Technology, Atlanta, Georgia
| | - Andrew C McShan
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - James C Gumbart
- School of Physics, Georgia Institute of Technology, Atlanta, Georgia; School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - Julie C Mitchell
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee.
| | - Matthew P Torres
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia; School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia.
| |
Collapse
|
9
|
Tan Y, Li M, Zhou Z, Tan P, Yu H, Fan G, Hong L. PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications. J Cheminform 2024; 16:92. [PMID: 39095917 PMCID: PMC11297785 DOI: 10.1186/s13321-024-00884-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 07/13/2024] [Indexed: 08/04/2024] Open
Abstract
Protein language models (PLMs) play a dominant role in protein representation learning. Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem with this representation method is that it simply divides the protein sequence into sequences of individual amino acids, ignoring the fact that certain residues often occur together. Therefore, it is inappropriate to view amino acids as isolated tokens. Instead, the PLMs should recognize the frequently occurring combinations of amino acids as a single token. In this study, we use the byte-pair-encoding algorithm and unigram to construct advanced residue vocabularies for protein sequence tokenization, and we have shown that PLMs pre-trained using these advanced vocabularies exhibit superior performance on downstream tasks when compared to those trained with simple vocabularies. Furthermore, we introduce PETA, a comprehensive benchmark for systematically evaluating PLMs. We find that vocabularies comprising 50 and 200 elements achieve optimal performance. Our code, model weights, and datasets are available at https://github.com/ginnm/ProteinPretraining . SCIENTIFIC CONTRIBUTION: This study introduces advanced protein sequence tokenization analysis, leveraging the byte-pair-encoding algorithm and unigram. By recognizing frequently occurring combinations of amino acids as single tokens, our proposed method enhances the performance of PLMs on downstream tasks. Additionally, we present PETA, a new comprehensive benchmark for the systematic evaluation of PLMs, demonstrating that vocabularies of 50 and 200 elements offer optimal performance.
Collapse
Affiliation(s)
- Yang Tan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China
| | - Mingchen Li
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China
| | - Ziyi Zhou
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Pan Tan
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
| | - Huiqun Yu
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
| | - Liang Hong
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China.
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China.
| |
Collapse
|
10
|
Jamasb AR, Morehead A, Joshi CK, Zhang Z, Didi K, Mathis S, Harris C, Tang J, Cheng J, Liò P, Blundell TL. Evaluating Representation Learning on the Protein Structure Universe. ARXIV 2024:arXiv:2406.13864v1. [PMID: 38947934 PMCID: PMC11213157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop.
Collapse
|
11
|
Nguyen VTD, Hy TS. Multimodal pretraining for unsupervised protein representation learning. Biol Methods Protoc 2024; 9:bpae043. [PMID: 38983679 PMCID: PMC11233121 DOI: 10.1093/biomethods/bpae043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 05/30/2024] [Accepted: 06/12/2024] [Indexed: 07/11/2024] Open
Abstract
Proteins are complex biomolecules essential for numerous biological processes, making them crucial targets for advancements in molecular biology, medical research, and drug design. Understanding their intricate, hierarchical structures, and functions is vital for progress in these fields. To capture this complexity, we introduce Multimodal Protein Representation Learning (MPRL), a novel framework for symmetry-preserving multimodal pretraining that learns unified, unsupervised protein representations by integrating primary and tertiary structures. MPRL employs Evolutionary Scale Modeling (ESM-2) for sequence analysis, Variational Graph Auto-Encoders (VGAE) for residue-level graphs, and PointNet Autoencoder (PAE) for 3D point clouds of atoms, each designed to capture the spatial and evolutionary intricacies of proteins while preserving critical symmetries. By leveraging Auto-Fusion to synthesize joint representations from these pretrained models, MPRL ensures robust and comprehensive protein representations. Our extensive evaluation demonstrates that MPRL significantly enhances performance in various tasks such as protein-ligand binding affinity prediction, protein fold classification, enzyme activity identification, and mutation stability prediction. This framework advances the understanding of protein dynamics and facilitates future research in the field. Our source code is publicly available at https://github.com/HySonLab/Protein_Pretrain.
Collapse
Affiliation(s)
| | - Truong Son Hy
- FPT Software AI Center, HCMC, Hanoi, Vietnam
- Department of Mathematics and Computer Science, Indiana State University, Terre Haute, IN, 47809, United States
| |
Collapse
|
12
|
Dotan E, Jaschek G, Pupko T, Belinkov Y. Effect of tokenization on transformers for biological sequences. Bioinformatics 2024; 40:btae196. [PMID: 38608190 PMCID: PMC11055402 DOI: 10.1093/bioinformatics/btae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 02/20/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open
Abstract
MOTIVATION Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.
Collapse
Affiliation(s)
- Edo Dotan
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Gal Jaschek
- Department of Genetics, Yale University School of Medicine, New Haven, CT 06510, United States
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Yonatan Belinkov
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
| |
Collapse
|
13
|
Ektefaie Y, Shen A, Bykova D, Marin M, Zitnik M, Farhat M. Evaluating generalizability of artificial intelligence models for molecular datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.25.581982. [PMID: 38464295 PMCID: PMC10925170 DOI: 10.1101/2024.02.25.581982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Deep learning has made rapid advances in modeling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata based (MB) or sequence-similarity based (SB) train and test splits of input data before assessing model performance. Here, we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, i.e., similarity between train and test splits. We introduce Spectra, a spectral framework for comprehensive model evaluation. For a given model and input data, Spectra plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability. We apply Spectra to 18 sequencing datasets with associated phenotypes ranging from antibiotic resistance in tuberculosis to protein-ligand binding to evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models, and convolutional neural networks. We show that SB and MB splits provide an incomplete assessment of model generalizability. With Spectra, we find as cross-split overlap decreases, deep learning models consistently exhibit a reduction in performance in a task- and model-dependent manner. Although no model consistently achieved the highest performance across all tasks, we show that deep learning models can generalize to previously unseen sequences on specific tasks. Spectra paves the way toward a better understanding of how foundation models generalize in biology.
Collapse
Affiliation(s)
- Yasha Ektefaie
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Andrew Shen
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Department of Computer Science, Northwestern University, Evanston, IL, USA
| | - Daria Bykova
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Maximillian Marin
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Harvard Data Science Initiative, Cambridge, MA, USA
| | - Maha Farhat
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Division of Pulmonary and Critical Care, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| |
Collapse
|
14
|
Valentini G, Malchiodi D, Gliozzo J, Mesiti M, Soto-Gomez M, Cabri A, Reese J, Casiraghi E, Robinson PN. The promises of large language models for protein design and modeling. FRONTIERS IN BIOINFORMATICS 2023; 3:1304099. [PMID: 38076030 PMCID: PMC10701588 DOI: 10.3389/fbinf.2023.1304099] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 11/07/2023] [Indexed: 10/16/2024] Open
Abstract
The recent breakthroughs of Large Language Models (LLMs) in the context of natural language processing have opened the way to significant advances in protein research. Indeed, the relationships between human natural language and the "language of proteins" invite the application and adaptation of LLMs to protein modelling and design. Considering the impressive results of GPT-4 and other recently developed LLMs in processing, generating and translating human languages, we anticipate analogous results with the language of proteins. Indeed, protein language models have been already trained to accurately predict protein properties, generate novel functionally characterized proteins, achieving state-of-the-art results. In this paper we discuss the promises and the open challenges raised by this novel and exciting research area, and we propose our perspective on how LLMs will affect protein modeling and design.
Collapse
Affiliation(s)
- Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
- ELLIS, European Laboratory for Learning and Intelligent Systems, Milan, Italy
| | - Dario Malchiodi
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Jessica Gliozzo
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
- European Commission, Joint Research Centre (JRC), Ispra, Italy
| | - Marco Mesiti
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Mauricio Soto-Gomez
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Alberto Cabri
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Justin Reese
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
| | - Elena Casiraghi
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
- ELLIS, European Laboratory for Learning and Intelligent Systems, Milan, Italy
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
| | | |
Collapse
|
15
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
16
|
Yang M, Chen S, Huang Z, Gao S, Yu T, Du T, Zhang H, Li X, Liu CM, Chen S, Li H. Deep learning-enabled discovery and characterization of HKT genes in Spartina alterniflora. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2023; 116:690-705. [PMID: 37494542 DOI: 10.1111/tpj.16397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 07/03/2023] [Accepted: 07/11/2023] [Indexed: 07/28/2023]
Abstract
Spartina alterniflora is a halophyte that can survive in high-salinity environments, and it is phylogenetically close to important cereal crops, such as maize and rice. It is of scientific interest to understand why S. alterniflora can live under such extremely stressful conditions. The molecular mechanism underlying its high-saline tolerance is still largely unknown. Here we investigated the possibility that high-affinity K+ transporters (HKTs), which function in salt tolerance and maintenance of ion homeostasis in plants, are responsible for salt tolerance in S. alterniflora. To overcome the imprecision and unstable of the gene screening method caused by the conventional sequence alignment, we used a deep learning method, DeepGOPlus, to automatically extract sequence and protein characteristics from our newly assemble S. alterniflora genome to identify SaHKTs. Results showed that a total of 16 HKT genes were identified. The number of S. alterniflora HKTs (SaHKTs) is larger than that in all other investigated plant species except wheat. Phylogenetically related SaHKT members had similar gene structures, conserved protein domains and cis-elements. Expression profiling showed that most SaHKT genes are expressed in specific tissues and are differentially expressed under salt stress. Yeast complementation expression analysis showed that type I members SaHKT1;2, SaHKT1;3 and SaHKT1;8 and type II members SaHKT2;1, SaHKT2;3 and SaHKT2;4 had low-affinity K+ uptake ability and that type II members showed stronger K+ affinity than rice and Arabidopsis HKTs, as well as most SaHKTs showed preference for Na+ transport. We believe the deep learning-based methods are powerful approaches to uncovering new functional genes, and the SaHKT genes identified are important resources for breeding new varieties of salt-tolerant crops.
Collapse
Affiliation(s)
- Maogeng Yang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
- Key Laboratory of Plant Molecular & Developmental Biology, College of Life Sciences, Yantai University, Yantai, Shandong, China
| | - Shoukun Chen
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
- Hainan Yazhou Bay Seed Laboratory, Sanya, Hainan, China
| | - Zhangping Huang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| | - Shang Gao
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| | - Tingxi Yu
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| | - Tingting Du
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| | - Hao Zhang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| | - Xiang Li
- State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Innovation Academy for Seed Design, Chinese Academy of Sciences, Beijing, China
| | - Chun-Ming Liu
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Key Laboratory of Plant Molecular Physiology, Institute of Botany, Chinese Academy of Sciences, Beijing, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
- School of Advanced Agricultural Sciences, Peking University, Beijing, China
| | - Shihua Chen
- Key Laboratory of Plant Molecular & Developmental Biology, College of Life Sciences, Yantai University, Yantai, Shandong, China
| | - Huihui Li
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
- Nanfan Research Institute, CAAS, Sanya, Hainan, China
| |
Collapse
|
17
|
Hui T, Descoteaux ML, Miao J, Lin YS. Training Neural Network Models Using Molecular Dynamics Simulation Results to Efficiently Predict Cyclic Hexapeptide Structural Ensembles. J Chem Theory Comput 2023. [PMID: 37236147 PMCID: PMC10373485 DOI: 10.1021/acs.jctc.3c00154] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Cyclic peptides have emerged as a promising class of therapeutics. However, their de novo design remains challenging, and many cyclic peptide drugs are simply natural products or their derivatives. Most cyclic peptides, including the current cyclic peptide drugs, adopt multiple conformations in water. The ability to characterize cyclic peptide structural ensembles would greatly aid their rational design. In a previous pioneering study, our group demonstrated that using molecular dynamics results to train machine learning models can efficiently predict structural ensembles of cyclic pentapeptides. Using this method, which was termed StrEAMM (Structural Ensembles Achieved by Molecular Dynamics and Machine Learning), linear regression models were able to predict the structural ensembles for an independent test set with R2 = 0.94 between the predicted populations for specific structures and the observed populations in molecular dynamics simulations for cyclic pentapeptides. An underlying assumption in these StrEAMM models is that cyclic peptide structural preferences are predominantly influenced by neighboring interactions, namely, interactions between (1,2) and (1,3) residues. Here we demonstrate that for larger cyclic peptides such as cyclic hexapeptides, linear regression models including only (1,2) and (1,3) interactions fail to produce satisfactory predictions (R2 = 0.47); further inclusion of (1,4) interactions leads to moderate improvements (R2 = 0.75). We show that when using convolutional neural networks and graph neural networks to incorporate complex nonlinear interaction patterns, we can achieve R2 = 0.97 and R2 = 0.91 for cyclic pentapeptides and hexapeptides, respectively.
Collapse
Affiliation(s)
- Tiffani Hui
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| | - Marc L Descoteaux
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| | - Jiayuan Miao
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| | - Yu-Shan Lin
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| |
Collapse
|
18
|
Wu F, Courty N, Jin S, Li SZ. Improving molecular representation learning with metric learning-enhanced optimal transport. PATTERNS (NEW YORK, N.Y.) 2023; 4:100714. [PMID: 37123438 PMCID: PMC10140620 DOI: 10.1016/j.patter.2023.100714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 12/29/2022] [Accepted: 03/01/2023] [Indexed: 05/02/2023]
Abstract
Training data are usually limited or heterogeneous in many chemical and biological applications. Existing machine learning models for chemistry and materials science fail to consider generalizing beyond training domains. In this article, we develop a novel optimal transport-based algorithm termed MROT to enhance their generalization capability for molecular regression problems. MROT learns a continuous label of the data by measuring a new metric of domain distances and a posterior variance regularization over the transport plan to bridge the chemical domain gap. Among downstream tasks, we consider basic chemical regression tasks in unsupervised and semi-supervised settings, including chemical property prediction and materials adsorption selection. Extensive experiments show that MROT significantly outperforms state-of-the-art models, showing promising potential in accelerating the discovery of new substances with desired properties.
Collapse
Affiliation(s)
- Fang Wu
- School of Engineering, Westlake University, Hangzhou 310024, China
- Institute of AI Industry Research, Tsinghua University, Beijing 100084, China
| | - Nicolas Courty
- French National Centre for Scientific Research, Southern Brittany University, Lorient, France
| | - Shuting Jin
- School of Informatics, Xiamen University, Xiamen 361005, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Stan Z. Li
- School of Engineering, Westlake University, Hangzhou 310024, China
| |
Collapse
|
19
|
Sanderson T, Bileschi ML, Belanger D, Colwell LJ. ProteInfer, deep neural networks for protein functional inference. eLife 2023; 12:e80942. [PMID: 36847334 PMCID: PMC10063232 DOI: 10.7554/elife.80942] [Citation(s) in RCA: 54] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 02/24/2023] [Indexed: 03/01/2023] Open
Abstract
Predicting the function of a protein from its amino acid sequence is a long-standing challenge in bioinformatics. Traditional approaches use sequence alignment to compare a query sequence either to thousands of models of protein families or to large databases of individual protein sequences. Here we introduce ProteInfer, which instead employs deep convolutional neural networks to directly predict a variety of protein functions - Enzyme Commission (EC) numbers and Gene Ontology (GO) terms - directly from an unaligned amino acid sequence. This approach provides precise predictions which complement alignment-based methods, and the computational efficiency of a single neural network permits novel and lightweight software interfaces, which we demonstrate with an in-browser graphical interface for protein function prediction in which all computation is performed on the user's personal computer with no data uploaded to remote servers. Moreover, these models place full-length amino acid sequences into a generalised functional space, facilitating downstream analysis and interpretation. To read the interactive version of this paper, please visit https://google-research.github.io/proteinfer/.
Collapse
Affiliation(s)
| | | | | | - Lucy J Colwell
- Google AIBostonUnited States
- University of CambridgeCambridgeUnited Kingdom
| |
Collapse
|
20
|
Wang F, Feng X, Kong R, Chang S. Generating new protein sequences by using dense network and attention mechanism. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:4178-4197. [PMID: 36899622 DOI: 10.3934/mbe.2023195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Protein engineering uses de novo protein design technology to change the protein gene sequence, and then improve the physical and chemical properties of proteins. These newly generated proteins will meet the needs of research better in properties and functions. The Dense-AutoGAN model is based on GAN, which is combined with an Attention mechanism to generate protein sequences. In this GAN architecture, the Attention mechanism and Encoder-decoder can improve the similarity of generated sequences and obtain variations in a smaller range on the original basis. Meanwhile, a new convolutional neural network is constructed by using the Dense. The dense network transmits in multiple layers over the generator network of the GAN architecture, which expands the training space and improves the effectiveness of sequence generation. Finally, the complex protein sequences are generated on the mapping of protein functions. Through comparisons of other models, the generated sequences of Dense-AutoGAN verify the model performance. The new generated proteins are highly accurate and effective in chemical and physical properties.
Collapse
Affiliation(s)
- Feng Wang
- School of Computer Engineering, Suzhou Vocational University, Suzhou, China
- Information Engineering Department, Changzhou University Huaide College, Taizhou, China
| | - Xiaochen Feng
- Information Engineering Department, Changzhou University Huaide College, Taizhou, China
| | - Ren Kong
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, China
| | - Shan Chang
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, China
| |
Collapse
|
21
|
Wu L, Yin C, Zhu J, Wu Z, He L, Xia Y, Xie S, Qin T, Liu TY. SPRoBERTa: protein embedding learning with local fragment modeling. Brief Bioinform 2022; 23:6711410. [PMID: 36136367 DOI: 10.1093/bib/bbac401] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 07/18/2022] [Accepted: 08/18/2022] [Indexed: 12/14/2022] Open
Abstract
Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.
Collapse
Affiliation(s)
- Lijun Wu
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Chengcan Yin
- National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China
| | - Jinhua Zhu
- CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China, No.96, JinZhai Road Baohe District, 230026, Hefei, Anhui Province, China
| | - Zhen Wu
- National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China
| | - Liang He
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Yingce Xia
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Shufang Xie
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Tao Qin
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Tie-Yan Liu
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| |
Collapse
|
22
|
An J, Weng X. Collectively encoding protein properties enriches protein language models. BMC Bioinformatics 2022; 23:467. [DOI: 10.1186/s12859-022-05031-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 10/31/2022] [Indexed: 11/10/2022] Open
Abstract
AbstractPre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.
Collapse
|
23
|
Wu F, Jin S, Jiang Y, Jin X, Tang B, Niu Z, Liu X, Zhang Q, Zeng X, Li SZ. Pre-Training of Equivariant Graph Matching Networks with Conformation Flexibility for Drug Binding. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022; 9:e2203796. [PMID: 36202759 PMCID: PMC9685463 DOI: 10.1002/advs.202203796] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 09/07/2022] [Indexed: 05/16/2023]
Abstract
The latest biological findings observe that the motionless "lock-and-key" theory is not generally applicable and that changes in atomic sites and binding pose can provide important information for understanding drug binding. However, the computational expenditure limits the growth of protein trajectory-related studies, thus hindering the possibility of supervised learning. A spatial-temporal pre-training method based on the modified equivariant graph matching networks, dubbed ProtMD which has two specially designed self-supervised learning tasks: atom-level prompt-based denoising generative task and conformation-level snapshot ordering task to seize the flexibility information inside molecular dynamics (MD) trajectories with very fine temporal resolutions is presented. The ProtMD can grant the encoder network the capacity to capture the time-dependent geometric mobility of conformations along MD trajectories. Two downstream tasks are chosen to verify the effectiveness of ProtMD through linear detection and task-specific fine-tuning. A huge improvement from current state-of-the-art methods, with a decrease of 4.3% in root mean square error for the binding affinity problem and an average increase of 13.8% in the area under receiver operating characteristic curve and the area under the precision-recall curve for the ligand efficacy problem is observed. The results demonstrate a strong correlation between the magnitude of conformation's motion in the 3D space and the strength with which the ligand binds with its receptor.
Collapse
Affiliation(s)
- Fang Wu
- School of EngineeringWestlake UniversityHangzhou310024China
- MindRank AI Ltd.Hangzhou310000China
| | - Shuting Jin
- MindRank AI Ltd.Hangzhou310000China
- School of InformaticsXiamen UniversityXiamen361005China
| | | | | | | | | | - Xiangrong Liu
- School of InformaticsXiamen UniversityXiamen361005China
| | - Qiang Zhang
- ZJU‐Hangzhou Global Scientific and Technological Innovation CenterHangzhou311200China
- College of Computer Science and TechnologyZhejiang UniversityHangzhou310013China
| | - Xiangxiang Zeng
- School of Information Science and EngineeringHunan UniversityHunan410082China
| | - Stan Z. Li
- School of EngineeringWestlake UniversityHangzhou310024China
| |
Collapse
|
24
|
|
25
|
Bileschi ML, Belanger D, Bryant DH, Sanderson T, Carter B, Sculley D, Bateman A, DePristo MA, Colwell LJ. Using deep learning to annotate the protein universe. Nat Biotechnol 2022; 40:932-937. [PMID: 35190689 DOI: 10.1038/s41587-021-01179-w] [Citation(s) in RCA: 127] [Impact Index Per Article: 42.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Accepted: 12/02/2021] [Indexed: 12/30/2022]
Abstract
Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.
Collapse
Affiliation(s)
| | | | | | - Theo Sanderson
- Google Research, Cambridge, MA, USA
- The Francis Crick Institute, London, UK
| | - Brandon Carter
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | | | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Mark A DePristo
- Google Research, Cambridge, MA, USA
- BigHat Biosciences, San Mateo, CA, USA
| | - Lucy J Colwell
- Google Research, Cambridge, MA, USA.
- Department of Chemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
26
|
Feng G, Yao H, Li C, Liu R, Huang R, Fan X, Ge R, Miao Q. ME-ACP: Multi-view neural networks with ensemble model for identification of anticancer peptides. Comput Biol Med 2022; 145:105459. [DOI: 10.1016/j.compbiomed.2022.105459] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 03/22/2022] [Accepted: 03/24/2022] [Indexed: 12/26/2022]
|
27
|
Pang Y, Liu B. SelfAT-Fold: Protein Fold Recognition Based on Residue-Based and Motif-Based Self-Attention Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1861-1869. [PMID: 33090951 DOI: 10.1109/tcbb.2020.3031888] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The protein fold recognition is a fundamental and crucial step of tertiary structure determination. In this regard, several computational predictors have been proposed. Recently, the predictive performance has been obviously improved by the fold-specific features generated by deep learning techniques. However, these methods failed to measure the global associations among residues or motifs along the protein sequences. Furthermore, these deep learning techniques are often treated as black boxes without interpretability. Inspired by the similarities between protein sequences and natural language sentences, we applied the self-attention mechanism derived from natural language processing (NLP) field to protein fold recognition. The motif-based self-attention network (MSAN) and the residue-based self-attention network (RSAN) were constructed based on a training set to capture the global associations among the structure motifs and residues along the protein sequences, respectively. The fold-specific attention features trained and generated from the training set were then combined with Support Vector Machines (SVMs) to predict the samples in the widely used LE benchmark dataset, which is fully independent from the training set. Experimental results showed that the proposed two SelfAT-Fold predictors outperformed 34 existing state-of-the-art computational predictors. The two SelfAT-Fold predictors were further tested on an independent dataset SCOP_TEST, and they can achieve stable performance. Furthermore, the fold-specific attention features can be used to analyse the characteristics of protein folds. The trained models and data of SelfAT-Fold can be downloaded from http://bliulab.net/selfAT_fold/.
Collapse
|
28
|
Hakala K, Kaewphan S, Bjorne J, Mehryary F, Moen H, Tolvanen M, Salakoski T, Ginter F. Neural Network and Random Forest Models in Protein Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1772-1781. [PMID: 33306472 DOI: 10.1109/tcbb.2020.3044230] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3.
Collapse
|
29
|
Villegas-Morcillo A, Gomez AM, Sanchez V. An analysis of protein language model embeddings for fold prediction. Brief Bioinform 2022; 23:6571527. [PMID: 35443054 DOI: 10.1093/bib/bbac142] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 03/21/2022] [Accepted: 03/28/2022] [Indexed: 11/13/2022] Open
Abstract
The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.
Collapse
Affiliation(s)
- Amelia Villegas-Morcillo
- Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain
| | - Angel M Gomez
- Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain
| | - Victoria Sanchez
- Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain
| |
Collapse
|
30
|
Zhang J, Yan K, Chen Q, Liu B. PreRBP-TL: prediction of species-specific RNA-binding proteins based on transfer learning. Bioinformatics 2022; 38:2135-2143. [PMID: 35176130 DOI: 10.1093/bioinformatics/btac106] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 11/18/2021] [Accepted: 02/15/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION RNA-binding proteins (RBPs) play crucial roles in post-transcriptional regulation. Accurate identification of RBPs helps to understand gene expression, regulation, etc. In recent years, some computational methods were proposed to identify RBPs. However, these methods fail to accurately identify RBPs from some specific species with limited data, such as bacteria. RESULTS In this study, we introduce a computational method called PreRBP-TL for identifying species-specific RBPs based on transfer learning. The weights of the prediction model were initialized by pretraining with the large general RBP dataset and then fine-tuned with the small species-specific RPB dataset by using transfer learning. The experimental results show that the PreRBP-TL achieves better performance for identifying the species-specific RBPs from Human, Arabidopsis, Escherichia coli and Salmonella, outperforming eight state-of-the-art computational methods. It is anticipated PreRBP-TL will become a useful method for identifying RBPs. AVAILABILITY AND IMPLEMENTATION For the convenience of researchers to identify RBPs, the web server of PreRBP-TL was established, freely available at http://bliulab.net/PreRBP-TL. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Qingcai Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
31
|
Detlefsen NS, Hauberg S, Boomsma W. Learning meaningful representations of protein sequences. Nat Commun 2022; 13:1914. [PMID: 35395843 PMCID: PMC8993921 DOI: 10.1038/s41467-022-29443-w] [Citation(s) in RCA: 61] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 03/15/2022] [Indexed: 01/27/2023] Open
Abstract
How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.
Collapse
Affiliation(s)
| | - Søren Hauberg
- Section for Cognitive Systems, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
32
|
Dee W. LMPred: predicting antimicrobial peptides using pre-trained language models and deep learning. BIOINFORMATICS ADVANCES 2022; 2:vbac021. [PMID: 36699381 PMCID: PMC9710646 DOI: 10.1093/bioadv/vbac021] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 03/01/2022] [Accepted: 03/29/2022] [Indexed: 01/28/2023]
Abstract
Motivation Antimicrobial peptides (AMPs) are increasingly being used in the development of new therapeutic drugs in areas such as cancer therapy and hypertension. Additionally, they are seen as an alternative to antibiotics due to the increasing occurrence of bacterial resistance. Wet-laboratory experimental identification, however, is both time-consuming and costly, so in silico models are now commonly used in order to screen new AMP candidates. Results This paper proposes a novel approach for creating model inputs; using pre-trained language models to produce contextualized embeddings, representing the amino acids within each peptide sequence, before a convolutional neural network is trained as the classifier. The results were validated on two datasets-one previously used in AMP prediction research, and a larger independent dataset created by this paper. Predictive accuracies of 93.33% and 88.26% were achieved, respectively, outperforming previous state-of-the-art classification models. Availability and implementation All codes are available and can be accessed here: https://github.com/williamdee1/LMPred_AMP_Prediction. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- William Dee
- Department of Bioinformatics, School of Biological and Behavioural Sciences, Queen Mary University of London, London E1 4NS, UK,To whom correspondence should be addressed.
| |
Collapse
|
33
|
Roethel A, Biliński P, Ishikawa T. BioS2Net: Holistic Structural and Sequential Analysis of Biomolecules Using a Deep Neural Network. Int J Mol Sci 2022; 23:2966. [PMID: 35328384 PMCID: PMC8954277 DOI: 10.3390/ijms23062966] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 03/05/2022] [Accepted: 03/08/2022] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND For decades, the rate of solving new biomolecular structures has been exceeding that at which their manual classification and feature characterisation can be carried out efficiently. Therefore, a new comprehensive and holistic tool for their examination is needed. METHODS Here we propose the Biological Sequence and Structure Network (BioS2Net), which is a novel deep neural network architecture that extracts both sequential and structural information of biomolecules. Our architecture consists of four main parts: (i) a sequence convolutional extractor, (ii) a 3D structure extractor, (iii) a 3D structure-aware sequence temporal network, as well as (iv) a fusion and classification network. RESULTS We have evaluated our approach using two protein fold classification datasets. BioS2Net achieved a 95.4% mean class accuracy on the eDD dataset and a 76% mean class accuracy on the F184 dataset. The accuracy of BioS2Net obtained on the eDD dataset was comparable to results achieved by previously published methods, confirming that the algorithm described in this article is a top-class solution for protein fold recognition. CONCLUSIONS BioS2Net is a novel tool for the holistic examination of biomolecules of known structure and sequence. It is a reliable tool for protein analysis and their unified representation as feature vectors.
Collapse
Affiliation(s)
- Albert Roethel
- Department of Molecular Biology, Institute of Biochemistry, Faculty of Biology, University of Warsaw, 02-096 Warsaw, Poland;
- College of Inter-Faculty Individual Studies in Mathematics and Natural Sciences, University of Warsaw, 02-097 Warsaw, Poland
| | - Piotr Biliński
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, 02-097 Warsaw, Poland;
| | - Takao Ishikawa
- Department of Molecular Biology, Institute of Biochemistry, Faculty of Biology, University of Warsaw, 02-096 Warsaw, Poland;
| |
Collapse
|
34
|
A deep learning model to detect novel pore-forming proteins. Sci Rep 2022; 12:2013. [PMID: 35132124 PMCID: PMC8821639 DOI: 10.1038/s41598-022-05970-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 01/12/2022] [Indexed: 11/09/2022] Open
Abstract
Many pore-forming proteins originating from pathogenic bacteria are toxic against agricultural pests. They are the key ingredients in several pesticidal products for agricultural use, including transgenic crops. There is an urgent need to identify novel pore-forming proteins to combat development of resistance in pests to existing products, and to develop products that are effective against a broader range of pests. Existing computational methodologies to search for these proteins rely on sequence homology-based approaches. These approaches are based on similarities between protein sequences, and thus are limited in their usefulness for discovering novel proteins. In this paper, we outline a novel deep learning model trained on pore-forming proteins from the public domain. We compare different ways of encoding protein information during training, and contrast it with traditional approaches. We show that our model is capable of identifying known pore formers with no sequence similarity to the proteins used to train the model, and therefore holds promise for identifying novel pore formers.
Collapse
|
35
|
Jin X, Luo X, Liu B. PHR-search: a search framework for protein remote homology detection based on the predicted protein hierarchical relationships. Brief Bioinform 2022; 23:6520306. [PMID: 35134113 DOI: 10.1093/bib/bbab609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 12/14/2021] [Accepted: 12/30/2021] [Indexed: 11/13/2022] Open
Abstract
Protein remote homology detection is one of the most fundamental research tool for protein structure and function prediction. Most search methods for protein remote homology detection are evaluated based on the Structural Classification of Proteins-extended (SCOPe) benchmark, but the diverse hierarchical structure relationships between the query protein and candidate proteins are ignored by these methods. In order to further improve the predictive performance for protein remote homology detection, a search framework based on the predicted protein hierarchical relationships (PHR-search) is proposed. In the PHR-search framework, the superfamily level prediction information is obtained by extracting the local and global features of the Hidden Markov Model (HMM) profile through a convolution neural network and it is converted to the fold level and class level prediction information according to the hierarchical relationships of SCOPe. Based on these predicted protein hierarchical relationships, filtering strategy and re-ranking strategy are used to construct the two-level search of PHR-search. Experimental results show that the PHR-search framework achieves the state-of-the-art performance by employing five basic search methods, including HHblits, JackHMMER, PSI-BLAST, DELTA-BLAST and PSI-BLASTexB. Furthermore, the web server of PHR-search is established, which can be accessed at http://bliulab.net/PHR-search.
Collapse
Affiliation(s)
- Xiaopeng Jin
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China.,School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Xiaoling Luo
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
36
|
Rorabaugh AK, Caíno-Lores S, Johnston T, Taufer M. High frequency accuracy and loss data of random neural networks trained on image datasets. Data Brief 2022; 40:107780. [PMID: 35036484 PMCID: PMC8749157 DOI: 10.1016/j.dib.2021.107780] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 12/15/2021] [Accepted: 12/29/2021] [Indexed: 11/15/2022] Open
Abstract
Neural Networks (NNs) are increasingly used across scientific domains to extract knowledge from experimental or computational data. An NN is composed of natural or artificial neurons that serve as simple processing units and are interconnected into a model architecture; it acquires knowledge from the environment through a learning process and stores this knowledge in its connections. The learning process is conducted by training. During NN training, the learning process can be tracked by periodically validating the NN and calculating its fitness. The resulting sequence of fitness values (i.e., validation accuracy or validation loss) is called the NN learning curve. The development of tools for NN design requires knowledge of diverse NNs and their complete learning curves. Generally, only final fully-trained fitness values for highly accurate NNs are made available to the community, hampering efforts to develop tools for NN design and leaving unaddressed aspects such as explaining the generation of an NN and reproducing its learning process. Our dataset fills this gap by fully recording the structure, metadata, and complete learning curves for a wide variety of random NNs throughout their training. Our dataset captures the lifespan of 6000 NNs throughout generation, training, and validation stages. It consists of a suite of 6000 tables, each table representing the lifespan of one NN. We generate each NN with randomized parameter values and train it for 40 epochs on one of three diverse image datasets (i.e., CIFAR-100, FashionMNIST, SVHN). We calculate and record each NN's fitness with high frequency-every half epoch-to capture the evolution of the training and validation process. As a result, for each NN, we record the generated parameter values describing the structure of that NN, the image dataset on which the NN trained, and all loss and accuracy values for the NN every half epoch. We put our dataset to the service of researchers studying NN performance and its evolution throughout training and validation. Statistical methods can be applied to our dataset to analyze the shape of learning curves in diverse NNs, and the relationship between an NN's structure and its fitness. Additionally, the structural data and metadata that we record enable the reconstruction and reproducibility of the associated NN.
Collapse
|
37
|
Cui F, Li S, Zhang Z, Sui M, Cao C, El-Latif Hesham A, Zou Q. DeepMC-iNABP: Deep learning for multiclass identification and classification of nucleic acid-binding proteins. Comput Struct Biotechnol J 2022; 20:2020-2028. [PMID: 35521556 PMCID: PMC9065708 DOI: 10.1016/j.csbj.2022.04.029] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 04/06/2022] [Accepted: 04/20/2022] [Indexed: 11/29/2022] Open
Abstract
Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play vital roles in gene expression. Accurate identification of these proteins is crucial. However, there are two existing challenges: one is the problem of ignoring DNA- and RNA-binding proteins (DRBPs), and the other is a cross-predicting problem referring to DBP predictors predicting DBPs as RBPs, and vice versa. In this study, we proposed a computational predictor, called DeepMC-iNABP, with the goal of solving these difficulties by utilizing a multiclass classification strategy and deep learning approaches. DBPs, RBPs, DRBPs and non-NABPs as separate classes of data were used for training the DeepMC-iNABP model. The results on test data collected in this study and two independent test datasets showed that DeepMC-iNABP has a strong advantage in identifying the DRBPs and has the ability to alleviate the cross-prediction problem to a certain extent. The web-server of DeepMC-iNABP is freely available at http://www.deepmc-inabp.net/. The datasets used in this research can also be downloaded from the website.
Collapse
Affiliation(s)
- Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Shuang Li
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Miaomiao Sui
- Graduate School Agricultural and Life Science, The University of Tokyo, Tokyo 1138657, Japan
| | - Chen Cao
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China
| | - Abd El-Latif Hesham
- Genetics Department, Faculty of Agriculture, Beni-Suef University, Beni-Suef 62511, Egypt
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
- Corresponding author at: Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
38
|
Madani M, Lin K, Tarakanova A. DSResSol: A Sequence-Based Solubility Predictor Created with Dilated Squeeze Excitation Residual Networks. Int J Mol Sci 2021; 22:13555. [PMID: 34948354 PMCID: PMC8704505 DOI: 10.3390/ijms222413555] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Revised: 12/13/2021] [Accepted: 12/14/2021] [Indexed: 11/16/2022] Open
Abstract
Protein solubility is an important thermodynamic parameter that is critical for the characterization of a protein's function, and a key determinant for the production yield of a protein in both the research setting and within industrial (e.g., pharmaceutical) applications. Experimental approaches to predict protein solubility are costly, time-consuming, and frequently offer only low success rates. To reduce cost and expedite the development of therapeutic and industrially relevant proteins, a highly accurate computational tool for predicting protein solubility from protein sequence is sought. While a number of in silico prediction tools exist, they suffer from relatively low prediction accuracy, bias toward the soluble proteins, and limited applicability for various classes of proteins. In this study, we developed a novel deep learning sequence-based solubility predictor, DSResSol, that takes advantage of the integration of squeeze excitation residual networks with dilated convolutional neural networks and outperforms all existing protein solubility prediction models. This model captures the frequently occurring amino acid k-mers and their local and global interactions and highlights the importance of identifying long-range interaction information between amino acid k-mers to achieve improved accuracy, using only protein sequence as input. DSResSol outperforms all available sequence-based solubility predictors by at least 5% in terms of accuracy when evaluated by two different independent test sets. Compared to existing predictors, DSResSol not only reduces prediction bias for insoluble proteins but also predicts soluble proteins within the test sets with an accuracy that is at least 13% higher than existing models. We derive the key amino acids, dipeptides, and tripeptides contributing to protein solubility, identifying glutamic acid and serine as critical amino acids for protein solubility prediction. Overall, DSResSol can be used for the fast, reliable, and inexpensive prediction of a protein's solubility to guide experimental design.
Collapse
Affiliation(s)
- Mohammad Madani
- Department of Mechanical Engineering, University of Connecticut, Storrs, CT 06269, USA;
- Department of Computer Science & Engineering, University of Connecticut, Storrs, CT 06269, USA;
| | - Kaixiang Lin
- Department of Computer Science & Engineering, University of Connecticut, Storrs, CT 06269, USA;
| | - Anna Tarakanova
- Department of Mechanical Engineering, University of Connecticut, Storrs, CT 06269, USA;
- Department of Biomedical Engineering, University of Connecticut, Storrs, CT 06269, USA
| |
Collapse
|
39
|
Villegas-Morcillo A, Gomez AM, Morales-Cordovilla JA, Sanchez V. Protein Fold Recognition From Sequences Using Convolutional and Recurrent Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2848-2854. [PMID: 32750896 DOI: 10.1109/tcbb.2020.3012732] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets, along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2020.3012732, source code and trained models are available at http://sigmat.ugr.es/~amelia/CNN-GRU-RF+/.
Collapse
|
40
|
Sandaruwan PD, Wannige CT. An improved deep learning model for hierarchical classification of protein families. PLoS One 2021; 16:e0258625. [PMID: 34669708 PMCID: PMC8528337 DOI: 10.1371/journal.pone.0258625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Accepted: 10/01/2021] [Indexed: 12/28/2022] Open
Abstract
Although genes carry information, proteins are the main role player in providing all the functionalities of a living organism. Massive amounts of different proteins involve in every function that occurs in a cell. These amino acid sequences can be hierarchically classified into a set of families and subfamilies depending on their evolutionary relatedness and similarities in their structure or function. Protein characterization to identify protein structure and function is done accurately using laboratory experiments. With the rapidly increasing huge amount of novel protein sequences, these experiments have become difficult to carry out since they are expensive, time-consuming, and laborious. Therefore, many computational classification methods are introduced to classify proteins and predict their functional properties. With the progress of the performance of the computational techniques, deep learning plays a key role in many areas. Novel deep learning models such as DeepFam, ProtCNN have been presented to classify proteins into their families recently. However, these deep learning models have been used to carry out the non-hierarchical classification of proteins. In this research, we propose a deep learning neural network model named DeepHiFam with high accuracy to classify proteins hierarchically into different levels simultaneously. The model achieved an accuracy of 98.38% for protein family classification and more than 80% accuracy for the classification of protein subfamilies and sub-subfamilies. Further, DeepHiFam performed well in the non-hierarchical classification of protein families and achieved an accuracy of 98.62% and 96.14% for the popular Pfam dataset and COG dataset respectively.
Collapse
|
41
|
Villegas-Morcillo A, Sanchez V, Gomez AM. FoldHSphere: deep hyperspherical embeddings for protein fold recognition. BMC Bioinformatics 2021; 22:490. [PMID: 34641786 PMCID: PMC8507389 DOI: 10.1186/s12859-021-04419-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 09/29/2021] [Indexed: 12/01/2022] Open
Abstract
Background Current state-of-the-art deep learning approaches for protein fold recognition learn protein embeddings that improve prediction performance at the fold level. However, there still exists aperformance gap at the fold level and the (relatively easier) family level, suggesting that it might be possible to learn an embedding space that better represents the protein folds. Results In this paper, we propose the FoldHSphere method to learn a better fold embedding space through a two-stage training procedure. We first obtain prototype vectors for each fold class that are maximally separated in hyperspherical space. We then train a neural network by minimizing the angular large margin cosine loss to learn protein embeddings clustered around the corresponding hyperspherical fold prototypes. Our network architectures, ResCNN-GRU and ResCNN-BGRU, process the input protein sequences by applying several residual-convolutional blocks followed by a gated recurrent unit-based recurrent layer. Evaluation results on the LINDAHL dataset indicate that the use of our hyperspherical embeddings effectively bridges the performance gap at the family and fold levels. Furthermore, our FoldHSpherePro ensemble method yields an accuracy of 81.3% at the fold level, outperforming all the state-of-the-art methods. Conclusions Our methodology is efficient in learning discriminative and fold-representative embeddings for the protein domains. The proposed hyperspherical embeddings are effective at identifying the protein fold class by pairwise comparison, even when amino acid sequence similarities are low. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04419-7.
Collapse
Affiliation(s)
- Amelia Villegas-Morcillo
- Department of Signal Theory, Telematics and Communications, University of Granada, Periodista Daniel Saucedo Aranda, 18071, Granada, Spain.
| | - Victoria Sanchez
- Department of Signal Theory, Telematics and Communications, University of Granada, Periodista Daniel Saucedo Aranda, 18071, Granada, Spain
| | - Angel M Gomez
- Department of Signal Theory, Telematics and Communications, University of Granada, Periodista Daniel Saucedo Aranda, 18071, Granada, Spain
| |
Collapse
|
42
|
Carrillo-Cabada H, Benson J, Razavi AM, Mulligan B, Cuendet MA, Weinstein H, Taufer M, Estrada T. A Graphic Encoding Method for Quantitative Classification of Protein Structure and Representation of Conformational Changes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1336-1349. [PMID: 31603792 PMCID: PMC9119144 DOI: 10.1109/tcbb.2019.2945291] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In order to successfully predict a proteins function throughout its trajectory, in addition to uncovering changes in its conformational state, it is necessary to employ techniques that maintain its 3D information while performing at scale. We extend a protein representation that encodes secondary and tertiary structure into fix-sized, color images, and a neural network architecture (called GEM-net) that leverages our encoded representation. We show the applicability of our method in two ways: (1) performing protein function prediction, hitting accuracy between 78 and 83 percent, and (2) visualizing and detecting conformational changes in protein trajectories during molecular dynamics simulations.
Collapse
|
43
|
Zhang J, Chen Q, Liu B. DeepDRBP-2L: A New Genome Annotation Predictor for Identifying DNA-Binding Proteins and RNA-Binding Proteins Using Convolutional Neural Network and Long Short-Term Memory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1451-1463. [PMID: 31722485 DOI: 10.1109/tcbb.2019.2952338] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) are two kinds of crucial proteins, which are associated with various cellule activities and some important diseases. Accurate identification of DBPs and RBPs facilitate both theoretical research and real world application. Existing sequence-based DBP predictors can accurately identify DBPs but incorrectly predict many RBPs as DBPs, and vice versa, resulting in low prediction precision. Moreover, some proteins (DRBPs) interacting with both DNA and RNA play important roles in gene expression and cannot be identified by existing computational methods. In this study, a two-level predictor named DeepDRBP-2L was proposed by combining Convolutional Neural Network (CNN) and the Long Short-Term Memory (LSTM). It is the first computational method that is able to identify DBPs, RBPs and DRBPs. Rigorous cross-validations and independent tests showed that DeepDRBP-2L is able to overcome the shortcoming of the existing methods and can go one further step to identify DRBPs. Application of DeepDRBP-2L to tomato genome further demonstrated its performance. The webserver of DeepDRBP-2L is freely available at http://bliulab.net/DeepDRBP-2L.
Collapse
|
44
|
Sanner MF, Dieguez L, Forli S, Lis E. Improving Docking Power for Short Peptides Using Random Forest. J Chem Inf Model 2021; 61:3074-3090. [PMID: 34124893 PMCID: PMC8543977 DOI: 10.1021/acs.jcim.1c00573] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
In recent years, therapeutic peptides have gained a lot interest as demonstrated by the 60 peptides approved as drugs in major markets and 150+ peptides currently in clinical trials. However, while small molecule docking is routinely used in rational drug design efforts, docking peptides has proven challenging partly because docking scoring functions, developed and calibrated for small molecules, perform poorly for these molecules. Here, we present random forest classifiers trained to discriminate correctly docked peptides. We show that, for a testing set of 47 protein-peptide complexes, structurally dissimilar from the training set and previously used to benchmark AutoDock Vina's ability to dock short peptides, these random forest classifiers improve docking power from ∼25% for AutoDock scoring functions to an average of ∼70%. These results pave the way for peptide-docking success rates comparable to those of small molecule docking. To develop these classifiers, we compiled the ProptPep37_2021 data set, a curated, high-quality set of 322 crystallographic protein-peptides complexes annotated with structural similarity information. The data set also provides a collection of high-quality putative poses with a range of deviations from the crystallographic pose, providing correct and incorrect poses (i.e., decoys) of the peptide for each entry. The ProptPep37_2021 data set as well as the classifiers presented here are freely available.
Collapse
Affiliation(s)
- Michel F. Sanner
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 93037, USA
| | - Leonard Dieguez
- Koliber Biosciences Inc., 12265 World Trade Drive, Suite G, San Diego, CA 92128, USA
| | - Stefano Forli
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 93037, USA
| | - Ewa Lis
- Koliber Biosciences Inc., 12265 World Trade Drive, Suite G, San Diego, CA 92128, USA
| |
Collapse
|
45
|
Li G, Zrimec J, Ji B, Geng J, Larsbrink J, Zelezniak A, Nielsen J, Engqvist MK. Performance of Regression Models as a Function of Experiment Noise. Bioinform Biol Insights 2021; 15:11779322211020315. [PMID: 34262264 PMCID: PMC8243133 DOI: 10.1177/11779322211020315] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 04/29/2021] [Indexed: 11/21/2022] Open
Abstract
Background: A challenge in developing machine learning regression models is that it is
difficult to know whether maximal performance has been reached on the test
dataset, or whether further model improvement is possible. In biology, this
problem is particularly pronounced as sample labels (response variables) are
typically obtained through experiments and therefore have experiment noise
associated with them. Such label noise puts a fundamental limit to the
metrics of performance attainable by regression models on the test
dataset. Results: We address this challenge by deriving an expected upper bound for the
coefficient of determination (R2) for regression
models when tested on the holdout dataset. This upper bound depends only on
the noise associated with the response variable in a dataset as well as its
variance. The upper bound estimate was validated via Monte Carlo simulations
and then used as a tool to bootstrap performance of regression models
trained on biological datasets, including protein sequence data,
transcriptomic data, and genomic data. Conclusions: The new method for estimating upper bounds for model performance on test data
should aid researchers in developing ML regression models that reach their
maximum potential. Although we study biological datasets in this work, the
new upper bound estimates will hold true for regression models from any
research field or application area where response variables have associated
noise.
Collapse
Affiliation(s)
- Gang Li
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Boyang Ji
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden.,Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Jun Geng
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Johan Larsbrink
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden.,Science for Life Laboratory, Stockholm, Sweden
| | - Jens Nielsen
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden.,Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kgs. Lyngby, Denmark.,BioInnovation Institute, Copenhagen N, Denmark
| | - Martin Km Engqvist
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| |
Collapse
|
46
|
Jin X, Liao Q, Liu B. S2L-PSIBLAST: a supervised two-layer search framework based on PSI-BLAST for protein remote homology detection. Bioinformatics 2021; 37:4321-4327. [PMID: 34170287 DOI: 10.1093/bioinformatics/btab472] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 05/29/2021] [Accepted: 06/24/2021] [Indexed: 01/26/2023] Open
Abstract
MOTIVATION Protein remote homology detection is a challenging task for the studies of protein evolutionary relationships. PSI-BLAST is an important and fundamental search method for detecting homology proteins. Although many improved versions of PSI-BLAST have been proposed, their performance is limited by the search processes of PSI-BLAST. RESULTS For further improving the performance of PSI-BLAST for protein remote homology detection, a supervised two-layer search framework based on PSI-BLAST (S2L-PSIBLAST) is proposed. S2L-PSIBLAST consists of a two-level search: the first-level search provides high-quality search results by using SMI-BLAST framework and double-link strategy to filter the non-homology protein sequences, the second-level search detects more homology proteins by profile-link similarity, and more accurate ranking lists for those detected protein sequences are obtained by learning to rank strategy. Experimental results on the updated version of Structural Classification of Proteins-extended benchmark dataset show that S2L-PSIBLAST not only obviously improves the performance of PSI-BLAST, but also achieves better performance on two improved versions of PSI-BLAST: DELTA-BLAST and PSI-BLASTexB. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaopeng Jin
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Qing Liao
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
47
|
Jiang H, Fan X. The Two-Step Clustering Approach for Metastable States Learning. Int J Mol Sci 2021; 22:6576. [PMID: 34205252 PMCID: PMC8233889 DOI: 10.3390/ijms22126576] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 06/14/2021] [Accepted: 06/14/2021] [Indexed: 01/20/2023] Open
Abstract
Understanding the energy landscape and the conformational dynamics is crucial for studying many biological or chemical processes, such as protein-protein interaction and RNA folding. Molecular Dynamics (MD) simulations have been a major source of dynamic structure. Although many methods were proposed for learning metastable states from MD data, some key problems are still in need of further investigation. Here, we give a brief review on recent progresses in this field, with an emphasis on some popular methods belonging to a two-step clustering framework, and hope to draw more researchers to contribute to this area.
Collapse
Affiliation(s)
- Hangjin Jiang
- Center for Data Science, Zhejiang University, Hangzhou 310058, China;
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong, China
| |
Collapse
|
48
|
Sequence-Based Prediction of Transmembrane Protein Crystallization Propensity. Interdiscip Sci 2021; 13:693-702. [PMID: 34143353 DOI: 10.1007/s12539-021-00448-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Revised: 05/31/2021] [Accepted: 06/04/2021] [Indexed: 10/21/2022]
Abstract
Transmembrane proteins play a vital role in cell life activities. There are several techniques to determine transmembrane protein structures and X-ray crystallography is the primary methodology. However, due to the special properties of transmembrane proteins, it is still hard to determine their structures by X-ray crystallography technique. To reduce experimental consumption and improve experimental efficiency, it is of great significance to develop computational methods for predicting the crystallization propensity of transmembrane proteins. In this work, we proposed a sequence-based machine learning method, namely Prediction of TransMembrane protein Crystallization propensity (PTMC), to predict the propensity of transmembrane protein crystallization. First, we obtained several general sequence features and the specific encoded features of relative solvent accessibility and hydrophobicity. Second, feature selection was employed to filter out redundant and irrelevant features, and the optimal feature subset is composed of hydrophobicity, amino acid composition and relative solvent accessibility. Finally, we chose extreme gradient boosting by comparing with other several machine learning methods. Comparative results on the independent test set indicate that PTMC outperforms state-of-the-art sequence-based methods in terms of sensitivity, specificity, accuracy, Matthew's Correlation Coefficient (MCC) and Area Under the receiver operating characteristic Curve (AUC). In comparison with two competitors, Bcrystal and TMCrys, PTMC achieves an improvement by 0.132 and 0.179 for sensitivity, 0.014 and 0.127 for specificity, 0.037 and 0.192 for accuracy, 0.128 and 0.362 for MCC, and 0.027 and 0.125 for AUC, respectively.
Collapse
|
49
|
New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: an updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies. Neural Comput Appl 2021; 33:15669-15692. [PMID: 34155424 PMCID: PMC8208613 DOI: 10.1007/s00521-021-06188-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 06/02/2021] [Indexed: 12/13/2022]
Abstract
During the last (15) years, improved omics sequencing technologies have expanded the scale and resolution of various biological applications, generating high-throughput datasets that require carefully chosen software tools to be processed. Therefore, following the sequencing development, bioinformatics researchers have been challenged to implement alignment algorithms for next-generation sequencing reads. However, nowadays selection of aligners based on genome characteristics is poorly studied, so our benchmarking study extended the “state of art” comparing 17 different aligners. The chosen tools were assessed on empirical human DNA- and RNA-Seq data, as well as on simulated datasets in human and mouse, evaluating a set of parameters previously not considered in such kind of benchmarks. As expected, we found that each tool was the best in specific conditions. For Ion Torrent single-end RNA-Seq samples, the most suitable aligners were CLC and BWA-MEM, which reached the best results in terms of efficiency, accuracy, duplication rate, saturation profile and running time. About Illumina paired-end osteomyelitis transcriptomics data, instead, the best performer algorithm, together with the already cited CLC, resulted Novoalign, which excelled in accuracy and saturation analyses. Segemehl and DNASTAR performed the best on both DNA-Seq data, with Segemehl particularly suitable for exome data. In conclusion, our study could guide users in the selection of a suitable aligner based on genome and transcriptome characteristics. However, several other aspects, emerged from our work, should be considered in the evolution of alignment research area, such as the involvement of artificial intelligence to support cloud computing and mapping to multiple genomes.
Collapse
|
50
|
Osadchy M, Kolodny R. How Deep Learning Tools Can Help Protein Engineers Find Good Sequences. J Phys Chem B 2021; 125:6440-6450. [PMID: 34105961 DOI: 10.1021/acs.jpcb.1c02449] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
The deep learning revolution introduced a new and efficacious way to address computational challenges in a wide range of fields, relying on large data sets and powerful computational resources. In protein engineering, we consider the challenge of computationally predicting properties of a protein and designing sequences with these properties. Indeed, accurate and fast deep network oracles for different properties of proteins have been developed. These learn to predict a property from an amino acid sequence by training on large sets of proteins that have this property. In particular, deep networks can learn from the set of all known protein sequences to identify ones that are protein-like. A fundamental challenge when engineering sequences that are both protein-like and satisfy a desired property is that these are rare instances within the vast space of all possible ones. When searching for these very rare instances, one would like to use good sampling procedures. Sampling approaches that are decoupled from the prediction of the property or in which the predictor uses only post-sampling to identify good instances are less efficient. The alternative is to use sampling methods that are geared to generate sequences satisfying and/or optimizing the predictor's desired properties. Deep learning has a class of architectures, denoted as generative models, which offer the capability of sampling from the learned distribution of a predicted property. Here, we review the use of deep learning tools to find good sequences for protein engineering, including developing oracles/predictors of a property of the proteins and methods that sample from a distribution of protein-like sequences to optimize the desired property.
Collapse
Affiliation(s)
- Margarita Osadchy
- Department of Computer Science, Jacobs Building, University of Haifa, 199 Aba Houshi Road, Mount Carmel, Haifa, Israel 3498838
| | - Rachel Kolodny
- Department of Computer Science, Jacobs Building, University of Haifa, 199 Aba Houshi Road, Mount Carmel, Haifa, Israel 3498838
| |
Collapse
|