1
|
Mao Y, Xu W, Shun Y, Chai L, Xue L, Yang Y, Li M. A multimodal model for protein function prediction. Sci Rep 2025; 15:10465. [PMID: 40140535 PMCID: PMC11947276 DOI: 10.1038/s41598-025-94612-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2025] [Accepted: 03/14/2025] [Indexed: 03/28/2025] Open
Abstract
Protein function, which is determined by sequence, structure, and other characteristics, plays a crucial role in an organism's performance. Existing protein function prediction methods mainly rely on sequence data and often ignore structural properties that are crucial for accurate prediction. Protein structure provides richer spatial and functional insights, which can significantly improve prediction accuracy. In this work, we propose a multi-modal protein function prediction model (MMPFP) that integrates protein sequence and structure information through the use of GCN, CNN, and Transformer models. We validate the model using the PDBest dataset, demonstrating that MMPFP outperforms traditional single-modal models in the molecular function (MF), biological process (BP), and cellular component (CC) prediction tasks. Specifically, MMPFP achieved AUPR scores of 0.693, 0.355, and 0.478; [Formula: see text] scores of 0.752, 0.629, and 0.691; and [Formula: see text] scores of 0.336, 0.488, and 0.459, showing a 3-5% improvement over single-modal models. Additionally, ablation studies confirm the effectiveness of the Transformer module within the GCN branch, further validating MMPFP's superior performance over existing methods. This multi-modal approach offers a more accurate and comprehensive framework for protein function prediction, addressing key limitations of current models.
Collapse
Affiliation(s)
- Yu Mao
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - WenHui Xu
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Yue Shun
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - LongXin Chai
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Lei Xue
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Yong Yang
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China.
| | - Mei Li
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China.
| |
Collapse
|
2
|
Kumar V, Deepak A, Ranjan A, Prakash A. CrossPredGO: A Novel Light-Weight Cross-Modal Multi-Attention Framework for Protein Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1709-1720. [PMID: 38843056 DOI: 10.1109/tcbb.2024.3410696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/27/2024]
Abstract
Proteins are represented in various ways, each contributing differently to protein-related tasks. Here, information from each representation (protein sequence, 3D structure, and interaction data) is combined for an efficient protein function prediction task. Recently, uni-modal has produced promising results with state-of-the-art attention mechanisms that learn the relative importance of features, whereas multi-modal approaches have produced promising results by simply concatenating obtained features using a computational approach from different representations which leads to an increase in the overall trainable parameters. In this paper, we propose a novel, light-weight cross-modal multi-attention (CrMoMulAtt) mechanism that captures the relative contribution of each modality with a lower number of trainable parameters. The proposed mechanism shows a higher contribution from PPI and a lower contribution from structure data. The results obtained from the proposed CrossPredGO mechanism demonstrate an increment in in the range of +(3.29 to 7.20)% with at most 31% lower trainable parameters compared with DeepGO and MultiPredGO.
Collapse
|
3
|
Kumar V, Deepak A, Ranjan A, Prakash A. Bi-SeqCNN: A Novel Light-Weight Bi-Directional CNN Architecture for Protein Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1922-1933. [PMID: 38990747 DOI: 10.1109/tcbb.2024.3426491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
Deep learning approaches, such as convolution neural networks (CNNs) and deep recurrent neural networks (RNNs), have been the backbone for predicting protein function, with promising state-of-the-art (SOTA) results. RNNs with an in-built ability (i) focus on past information, (ii) collect both short-and-long range dependency information, and (iii) bi-directional processing offers a strong sequential processing mechanism. CNNs, however, are confined to focusing on short-term information from both the past and the future, although they offer parallelism. Therefore, a novel bi-directional CNN that strictly complies with the sequential processing mechanism of RNNs is introduced and is used for developing a protein function prediction framework, Bi-SeqCNN. This is a sub-sequence-based framework. Further, Bi-SeqCNN is an ensemble approach to better the prediction results. To our knowledge, this is the first time bi-directional CNNs are employed for general temporal data analysis and not just for protein sequences. The proposed architecture produces improvements up to +5.5% over contemporary SOTA methods on three benchmark protein sequence datasets. Moreover, it is substantially lighter and attain these results with (0.50-0.70 times) fewer parameters than the SOTA methods.
Collapse
|
4
|
Taha K. Employing Machine Learning Techniques to Detect Protein Function: A Survey, Experimental, and Empirical Evaluations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1965-1986. [PMID: 39008392 DOI: 10.1109/tcbb.2024.3427381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
This review article delves deeply into the various machine learning (ML) methods and algorithms employed in discerning protein functions. Each method discussed is assessed for its efficacy, limitations, potential improvements, and future prospects. We present an innovative hierarchical classification system that arranges algorithms into intricate categories and unique techniques. This taxonomy is based on a tri-level hierarchy, starting with the methodology category and narrowing down to specific techniques. Such a framework allows for a structured and comprehensive classification of algorithms, assisting researchers in understanding the interrelationships among diverse algorithms and techniques. The study incorporates both empirical and experimental evaluations to differentiate between the techniques. The empirical evaluation ranks the techniques based on four criteria. The experimental assessments rank: (1) individual techniques under the same methodology sub-category, (2) different sub-categories within the same category, and (3) the broad categories themselves. Integrating the innovative methodological classification, empirical findings, and experimental assessments, the article offers a well-rounded understanding of ML strategies in protein function identification. The paper also explores techniques for multi-task and multi-label detection of protein functions, in addition to focusing on single-task methods. Moreover, the paper sheds light on the future avenues of ML in protein function determination.
Collapse
|
5
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A CNN-CBAM-BIGRU model for protein function prediction. Stat Appl Genet Mol Biol 2024; 23:sagmb-2024-0004. [PMID: 38943434 DOI: 10.1515/sagmb-2024-0004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 06/07/2024] [Indexed: 07/01/2024]
Abstract
Understanding a protein's function based solely on its amino acid sequence is a crucial but intricate task in bioinformatics. Traditionally, this challenge has proven difficult. However, recent years have witnessed the rise of deep learning as a powerful tool, achieving significant success in protein function prediction. Their strength lies in their ability to automatically learn informative features from protein sequences, which can then be used to predict the protein's function. This study builds upon these advancements by proposing a novel model: CNN-CBAM+BiGRU. It incorporates a Convolutional Block Attention Module (CBAM) alongside BiGRUs. CBAM acts as a spotlight, guiding the CNN to focus on the most informative parts of the protein data, leading to more accurate feature extraction. BiGRUs, a type of Recurrent Neural Network (RNN), excel at capturing long-range dependencies within the protein sequence, which are essential for accurate function prediction. The proposed model integrates the strengths of both CNN-CBAM and BiGRU. This study's findings, validated through experimentation, showcase the effectiveness of this combined approach. For the human dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +1.0 % for cellular components, +1.1 % for molecular functions, and +0.5 % for biological processes. For the yeast dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +2.4 % for the cellular component, +1.2 % for molecular functions, and +0.6 % for biological processes.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, 230635 National Institute of Technology Patna , Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, 230635 National Institute of Technology Patna , Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, C.V. Raman Global University, Bhubaneswar, Odisha, India
| | | |
Collapse
|
6
|
Ranjan A, Fahad MS, Fernandez-Baca D, Tripathi S, Deepak A. MCWS-Transformers: Towards an Efficient Modeling of Protein Sequences via Multi Context-Window Based Scaled Self-Attention. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1188-1199. [PMID: 35536815 DOI: 10.1109/tcbb.2022.3173789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
This paper advances the self-attention mechanism in the standard transformer network specific to the modeling of the protein sequences. We introduce a novel context-window based scaled self-attention mechanism for processing protein sequences that is based on the notion of (i) local context and (ii) large contextual pattern. Both notions are essential to building a good representation for protein sequences. The proposed context-window based scaled self-attention mechanism is further used to build the multi context-window based scaled (MCWS) transformer network for the protein function prediction task at the protein sub-sequence level. Overall, the proposed MCWS transformer network produced improved predictive performances, outperforming existing state-of-the-art approaches by substantial margins. With respect to the standard transformer network, the proposed network produced improvements in F1-score of +2.30% and +2.08% on the biological process (BP) and molecular function (MF) datasets, respectively. The corresponding improvements over the state-of-the-art ProtVecGen-Plus+ProtVecGen-Ensemble approach are +3.38% (BP) and +2.86% (MF). Equally important, robust performances were obtained across protein sequences of different lengths.
Collapse
|
7
|
Ranjan A, Tiwari A, Deepak A. A Sub-Sequence Based Approach to Protein Function Prediction via Multi-Attention Based Multi-Aspect Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:94-105. [PMID: 34826296 DOI: 10.1109/tcbb.2021.3130923] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Inferring the protein function(s) via the protein sub-sequence classification is often obstructed due to lack of knowledge about function(s) of sub-sequences in the protein sequence. In this regard, we develop a novel "multi-aspect" paradigm to perform the sub-sequence classification in an efficient way by utilizing the information of the parent sequence. The aspects are: (1) Multi-label: independent labelling of sub-sequences with more than one functions of the parent sequence, and (ii) Label-relevance: scoring the parent functions to highlight the relevance of performing a given function by the sub-sequence. The multi-aspect paradigm is used to propose the "Multi-Attention Based Multi-Aspect Network" for classifying the protein sub-sequences, where multi-attention is a novel approach to process sub-sequences at word-level. Next, the proposed Global-ProtEnc method is a sub-sequence based approach to encoding protein sequences for protein function prediction task, which is finally used to develop as ensemble methods, Global-ProtEnc-Plus. Evaluations of both the Global-ProtEnc and the Global-ProtEnc-Plus methods on the benchmark CAFA3 dataset delivered a outstanding performances. Compared to the state-of-the-art DeepGOPlus, the improvements in Fmax with the Global-ProtEnc-Plus for the biological process is +6.50 percent and cellular component is +1.90 percent.
Collapse
|
8
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction. Stat Appl Genet Mol Biol 2023; 22:sagmb-2022-0057. [PMID: 37658681 DOI: 10.1515/sagmb-2022-0057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 04/20/2023] [Indexed: 09/03/2023]
Abstract
Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, ITER, Siksha 'O' Anusandhan University (Deemed to be University), Bhubaneswar, Odisha, India
| | | |
Collapse
|
9
|
Ranjan A, Fernandez-Baca D, Tripathi S, Deepak A. An Ensemble Tf-Idf Based Approach to Protein Function Prediction via Sequence Segmentation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2685-2696. [PMID: 34185646 DOI: 10.1109/tcbb.2021.3093060] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
This paper explores the use of variants of tf-idf-based descriptors, namely length-normalized-tf-idf and log-normalized-tf-idf, combined with a segmentation technique, for efficient modeling of variable-length protein sequences. The proposed solution, ProtVecGen-Ensemble, is an ensemble of three models trained on differently segmented datasets constructed from an input dataset containing complete protein sequences. Evaluations using biological process (BP) and molecular function (MF) datasets demonstrate that the proposed feature set is not only superior to its contemporaries but also produces more consistent results with respect to variation in sequence lengths. Improvements of +6.07% (BP) and +7.56% (MF) over state-of-the-art tf-idf-based MLDA feature set were obtained. The best results were achieved when ProtVecGen-Ensemble was combined with ProtVecGen-Plus - the state-of-the-art method for protein function prediction - resulting in improvements of +8.90% (BP) and +11.28% (MF) over MLDA and +1.49% (BP) and +2.07% (MF) over ProtVecGen-Plus+MLDA. To capture the performance consistency with respect to sequence lengths, we have defined a variance-based metric, with lower values indicating better performance. On this metric, the proposed ProtVecGen-Ensemble+ProtVecGen-Plus framework resulted in reductions of 56.85 percent (BP) and 56.08 percent (MF) over MLDA and 10.37 percent (BP) and 26.48 percent (MF) over ProtVecGenPlus+MLDA.
Collapse
|
10
|
Long short term memory based functional characterization model for unknown protein sequences using ensemble of shallow and deep features. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06674-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
11
|
Gao W, Mahajan SP, Sulam J, Gray JJ. Deep Learning in Protein Structural Modeling and Design. PATTERNS (NEW YORK, N.Y.) 2020; 1:100142. [PMID: 33336200 PMCID: PMC7733882 DOI: 10.1016/j.patter.2020.100142] [Citation(s) in RCA: 100] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Deep learning is catalyzing a scientific revolution fueled by big data, accessible toolkits, and powerful computational resources, impacting many fields, including protein structural modeling. Protein structural modeling, such as predicting structure from amino acid sequence and evolutionary information, designing proteins toward desirable functionality, or predicting properties or behavior of a protein, is critical to understand and engineer biological systems at the molecular level. In this review, we summarize the recent advances in applying deep learning techniques to tackle problems in protein structural modeling and design. We dissect the emerging approaches using deep learning techniques for protein structural modeling and discuss advances and challenges that must be addressed. We argue for the central importance of structure, following the "sequence → structure → function" paradigm. This review is directed to help both computational biologists to gain familiarity with the deep learning methods applied in protein modeling, and computer scientists to gain perspective on the biologically meaningful problems that may benefit from deep learning techniques.
Collapse
Affiliation(s)
- Wenhao Gao
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Sai Pooja Mahajan
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeremias Sulam
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeffrey J. Gray
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
12
|
Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 2020; 25:689-705. [DOI: 10.1016/j.drudis.2020.01.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 12/20/2019] [Accepted: 01/28/2020] [Indexed: 01/06/2023]
|