1
|
Chen S, Zheng P, Zheng L, Yao Q, Meng Z, Lin L, Chen X, Liu R. BERT-DomainAFP: Antifreeze protein recognition and classification model based on BERT and structural domain annotation. iScience 2025; 28:112077. [PMID: 40241758 PMCID: PMC12002629 DOI: 10.1016/j.isci.2025.112077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Revised: 01/03/2025] [Accepted: 02/17/2025] [Indexed: 04/18/2025] Open
Abstract
Antifreeze proteins (AFPs) are crucial for organisms to adapt to low temperatures, with applications in medicine, food storage, aquaculture, and agriculture. Accurate AFP identification is challenging due to structural and sequence diversity. To improve prediction and classification, we propose BERT-DomainAFP, a deep learning model trained on the AntiFreezeDomains dataset created with a novel annotation strategy. The model uses pre-trained ProteinBERT and incorporates oversampling and undersampling techniques to handle unbalanced data, ensuring high predictive ability. BERT-DomainAFP achieves 98.48% accuracy, the highest among existing models, and can classify different AFP types based on structural domain features. This model outperforms current tools, offering a promising solution for AFP recognition and classification in research and applications.
Collapse
Affiliation(s)
- Shengzhen Chen
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ping Zheng
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Lele Zheng
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Qinglong Yao
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ziyu Meng
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Longshan Lin
- Laboratory of Marine Biodiversity Research, Third Institute of Oceanography, Ministry of Natural Resources, Xiamen 361005, China
| | - Xinhua Chen
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ruoyu Liu
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| |
Collapse
|
2
|
Vu TTD, Kim J, Jung J. An experimental analysis of graph representation learning for Gene Ontology based protein function prediction. PeerJ 2024; 12:e18509. [PMID: 39553733 PMCID: PMC11569786 DOI: 10.7717/peerj.18509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Accepted: 10/21/2024] [Indexed: 11/19/2024] Open
Abstract
Understanding protein function is crucial for deciphering biological systems and facilitating various biomedical applications. Computational methods for predicting Gene Ontology functions of proteins emerged in the 2000s to bridge the gap between the number of annotated proteins and the rapidly growing number of newly discovered amino acid sequences. Recently, there has been a surge in studies applying graph representation learning techniques to biological networks to enhance protein function prediction tools. In this review, we provide fundamental concepts in graph embedding algorithms. This study described graph representation learning methods for protein function prediction based on four principal data categories, namely PPI network, protein structure, Gene Ontology graph, and integrated graph. The commonly used approaches for each category were summarized and diagrammed, with the specific results of each method explained in detail. Finally, existing limitations and potential solutions were discussed, and directions for future research within the protein research community were suggested.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Faculty of Fundamental Sciences, University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | - Jeongho Kim
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| |
Collapse
|
3
|
Taha K. Employing Machine Learning Techniques to Detect Protein Function: A Survey, Experimental, and Empirical Evaluations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1965-1986. [PMID: 39008392 DOI: 10.1109/tcbb.2024.3427381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
This review article delves deeply into the various machine learning (ML) methods and algorithms employed in discerning protein functions. Each method discussed is assessed for its efficacy, limitations, potential improvements, and future prospects. We present an innovative hierarchical classification system that arranges algorithms into intricate categories and unique techniques. This taxonomy is based on a tri-level hierarchy, starting with the methodology category and narrowing down to specific techniques. Such a framework allows for a structured and comprehensive classification of algorithms, assisting researchers in understanding the interrelationships among diverse algorithms and techniques. The study incorporates both empirical and experimental evaluations to differentiate between the techniques. The empirical evaluation ranks the techniques based on four criteria. The experimental assessments rank: (1) individual techniques under the same methodology sub-category, (2) different sub-categories within the same category, and (3) the broad categories themselves. Integrating the innovative methodological classification, empirical findings, and experimental assessments, the article offers a well-rounded understanding of ML strategies in protein function identification. The paper also explores techniques for multi-task and multi-label detection of protein functions, in addition to focusing on single-task methods. Moreover, the paper sheds light on the future avenues of ML in protein function determination.
Collapse
|
4
|
Meng L, Wang X. TAWFN: a deep learning framework for protein function prediction. Bioinformatics 2024; 40:btae571. [PMID: 39312678 PMCID: PMC11639667 DOI: 10.1093/bioinformatics/btae571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 08/27/2024] [Accepted: 09/19/2024] [Indexed: 09/25/2024] Open
Abstract
MOTIVATION Proteins play pivotal roles in biological systems, and precise prediction of their functions is indispensable for practical applications. Despite the surge in protein sequence data facilitated by high-throughput techniques, unraveling the exact functionalities of proteins still demands considerable time and resources. Currently, numerous methods rely on protein sequences for prediction, while methods targeting protein structures are scarce, often employing convolutional neural networks (CNN) or graph convolutional networks (GCNs) individually. RESULTS To address these challenges, our approach starts from protein structures and proposes a method that combines CNN and GCN into a unified framework called the two-model adaptive weight fusion network (TAWFN) for protein function prediction. First, amino acid contact maps and sequences are extracted from the protein structure. Then, the sequence is used to generate one-hot encoded features and deep semantic features. These features, along with the constructed graph, are fed into the adaptive graph convolutional networks (AGCN) module and the multi-layer convolutional neural network (MCNN) module as needed, resulting in preliminary classification outcomes. Finally, the preliminary classification results are inputted into the adaptive weight computation network, where adaptive weights are calculated to fuse the initial predictions from both networks, yielding the final prediction result. To evaluate the effectiveness of our method, experiments were conducted on the PDBset and AFset datasets. For molecular function, biological process, and cellular component tasks, TAWFN achieved area under the precision-recall curve (AUPR) values of 0.718, 0.385, and 0.488 respectively, with corresponding Fmax scores of 0.762, 0.628, and 0.693, and Smin scores of 0.326, 0.483, and 0.454. The experimental results demonstrate that TAWFN exhibits promising performance, outperforming existing methods. AVAILABILITY AND IMPLEMENTATION The TAWFN source code can be found at: https://github.com/ss0830/TAWFN.
Collapse
Affiliation(s)
- Lu Meng
- College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110000, China
| | - Xiaoran Wang
- College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110000, China
| |
Collapse
|
5
|
Zhao Y, Yang Z, Wang L, Zhang Y, Lin H, Wang J. Predicting Protein Functions Based on Heterogeneous Graph Attention Technique. IEEE J Biomed Health Inform 2024; 28:2408-2415. [PMID: 38319781 DOI: 10.1109/jbhi.2024.3357834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2024]
Abstract
In bioinformatics, protein function prediction stands as a fundamental area of research and plays a crucial role in addressing various biological challenges, such as the identification of potential targets for drug discovery and the elucidation of disease mechanisms. However, known functional annotation databases usually provide positive experimental annotations that proteins carry out a given function, and rarely record negative experimental annotations that proteins do not carry out a given function. Therefore, existing computational methods based on deep learning models focus on these positive annotations for prediction and ignore these scarce but informative negative annotations, leading to an underestimation of precision. To address this issue, we introduce a deep learning method that utilizes a heterogeneous graph attention technique. The method first constructs a heterogeneous graph that covers the protein-protein interaction network, ontology structure, and positive and negative annotation information. Then, it learns embedding representations of proteins and ontology terms by using the heterogeneous graph attention technique. Finally, it leverages these learned representations to reconstruct the positive protein-term associations and score unobserved functional annotations. It can enhance the predictive performance by incorporating these known limited negative annotations into the constructed heterogeneous graph. Experimental results on three species (i.e., Human, Mouse, and Arabidopsis) demonstrate that our method can achieve better performance in predicting new protein annotations than state-of-the-art methods.
Collapse
|
6
|
Zhang C, Gao Q, Li M, Yu T. Implementing link prediction in protein networks via feature fusion models based on graph neural networks. Comput Biol Chem 2024; 108:107980. [PMID: 38000328 DOI: 10.1016/j.compbiolchem.2023.107980] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 10/07/2023] [Accepted: 11/02/2023] [Indexed: 11/26/2023]
Abstract
MOTIVATION Protein-protein interactions serve as the cornerstone for various biochemical processes within biological organisms. Existing research methodologies predominantly employ link prediction techniques to analyze these interaction networks. However, traditional approaches often fall short in delivering satisfactory predictive performance when applied to multi-species datasets. Current computational methods largely focus on analyzing the network topology, resulting in a somewhat monolithic feature set. The integration of diverse features in the model could potentially yield superior performance and broader applicability. To this end, we propose an autoencoder model built on graph neural networks, designed to enhance both predictive performance and generalizability by leveraging the integration of gene ontology. RESULTS In this research, we developed AGraphSAGE, a model specifically designed for analyzing protein-protein interaction network data. By seamlessly integrating gene ontology into the graph structure, we employed a dual-channel graph sampling and aggregation network that capitalizes on topological information to process high-dimensional features. Feature fusion is achieved through the implementation of graph attention mechanisms, and we adopted a link prediction framework as the experimental training model. Performance was evaluated on real-world datasets using key metrics, such as Area Under the Curve (AUC). A hyperparameter search space was established, and a Bayesian optimization strategy was applied to iteratively fine-tune the model, assessing the impact of various parameters on predictive efficacy. The experimental results validate that our proposed model is capable of effectively predicting protein-protein interactions across diverse biological species.
Collapse
Affiliation(s)
- Chi Zhang
- College of Computer and Control Engineering, Qiqihar University, Qiqihar 161006, China
| | - Qian Gao
- College of Computer and Control Engineering, Qiqihar University, Qiqihar 161006, China
| | - Ming Li
- College of Computer and Control Engineering, Qiqihar University, Qiqihar 161006, China.
| | - Tianfei Yu
- College of Life Science and Agriculture Forestry, Qiqihar University, Qiqihar 161006, China.
| |
Collapse
|
7
|
Wang W, Shuai Y, Yang Q, Zhang F, Zeng M, Li M. A comprehensive computational benchmark for evaluating deep learning-based protein function prediction approaches. Brief Bioinform 2024; 25:bbae050. [PMID: 38388682 PMCID: PMC10883809 DOI: 10.1093/bib/bbae050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 01/17/2024] [Accepted: 01/26/2024] [Indexed: 02/24/2024] Open
Abstract
Proteins play an important role in life activities and are the basic units for performing functions. Accurately annotating functions to proteins is crucial for understanding the intricate mechanisms of life and developing effective treatments for complex diseases. Traditional biological experiments struggle to keep pace with the growing number of known proteins. With the development of high-throughput sequencing technology, a wide variety of biological data provides the possibility to accurately predict protein functions by computational methods. Consequently, many computational methods have been proposed. Due to the diversity of application scenarios, it is necessary to conduct a comprehensive evaluation of these computational methods to determine the suitability of each algorithm for specific cases. In this study, we present a comprehensive benchmark, BeProf, to process data and evaluate representative computational methods. We first collect the latest datasets and analyze the data characteristics. Then, we investigate and summarize 17 state-of-the-art computational methods. Finally, we propose a novel comprehensive evaluation metric, design eight application scenarios and evaluate the performance of existing methods on these scenarios. Based on the evaluation, we provide practical recommendations for different scenarios, enabling users to select the most suitable method for their specific needs. All of these servers can be obtained from https://csuligroup.com/BEPROF and https://github.com/CSUBioGroup/BEPROF.
Collapse
Affiliation(s)
- Wenkang Wang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Yunyan Shuai
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Qiurong Yang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| |
Collapse
|
8
|
Wu H, Liu J, Jiang T, Zou Q, Qi S, Cui Z, Tiwari P, Ding Y. AttentionMGT-DTA: A multi-modal drug-target affinity prediction using graph transformer and attention mechanism. Neural Netw 2024; 169:623-636. [PMID: 37976593 DOI: 10.1016/j.neunet.2023.11.018] [Citation(s) in RCA: 36] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 09/29/2023] [Accepted: 11/07/2023] [Indexed: 11/19/2023]
Abstract
The accurate prediction of drug-target affinity (DTA) is a crucial step in drug discovery and design. Traditional experiments are very expensive and time-consuming. Recently, deep learning methods have achieved notable performance improvements in DTA prediction. However, one challenge for deep learning-based models is appropriate and accurate representations of drugs and targets, especially the lack of effective exploration of target representations. Another challenge is how to comprehensively capture the interaction information between different instances, which is also important for predicting DTA. In this study, we propose AttentionMGT-DTA, a multi-modal attention-based model for DTA prediction. AttentionMGT-DTA represents drugs and targets by a molecular graph and binding pocket graph, respectively. Two attention mechanisms are adopted to integrate and interact information between different protein modalities and drug-target pairs. The experimental results showed that our proposed model outperformed state-of-the-art baselines on two benchmark datasets. In addition, AttentionMGT-DTA also had high interpretability by modeling the interaction strength between drug atoms and protein residues. Our code is available at https://github.com/JK-Liu7/AttentionMGT-DTA.
Collapse
Affiliation(s)
- Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.
| | - Junkai Liu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China; Yangtze Delta Region Institute(Quzhou), University of Electronic Science and Technology of China, Quzhou, 324003, China.
| | - Tengsheng Jiang
- Gusu School, Nanjing Medical University, Suzhou, 215009, China.
| | - Quan Zou
- Yangtze Delta Region Institute(Quzhou), University of Electronic Science and Technology of China, Quzhou, 324003, China.
| | - Shujie Qi
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.
| | - Zhiming Cui
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.
| | - Prayag Tiwari
- School of Information Technology, Halmstad University, Sweden.
| | - Yijie Ding
- Yangtze Delta Region Institute(Quzhou), University of Electronic Science and Technology of China, Quzhou, 324003, China.
| |
Collapse
|
9
|
Chen J, Gu Z, Lai L, Pei J. In silico protein function prediction: the rise of machine learning-based approaches. MEDICAL REVIEW (2021) 2023; 3:487-510. [PMID: 38282798 PMCID: PMC10808870 DOI: 10.1515/mr-2023-0038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 10/11/2023] [Indexed: 01/30/2024]
Abstract
Proteins function as integral actors in essential life processes, rendering the realm of protein research a fundamental domain that possesses the potential to propel advancements in pharmaceuticals and disease investigation. Within the context of protein research, an imperious demand arises to uncover protein functionalities and untangle intricate mechanistic underpinnings. Due to the exorbitant costs and limited throughput inherent in experimental investigations, computational models offer a promising alternative to accelerate protein function annotation. In recent years, protein pre-training models have exhibited noteworthy advancement across multiple prediction tasks. This advancement highlights a notable prospect for effectively tackling the intricate downstream task associated with protein function prediction. In this review, we elucidate the historical evolution and research paradigms of computational methods for predicting protein function. Subsequently, we summarize the progress in protein and molecule representation as well as feature extraction techniques. Furthermore, we assess the performance of machine learning-based algorithms across various objectives in protein function prediction, thereby offering a comprehensive perspective on the progress within this field.
Collapse
Affiliation(s)
- Jiaxiao Chen
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Zhonghui Gu
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Luhua Lai
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, China
- Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014), Beijing, China
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014), Beijing, China
| |
Collapse
|
10
|
Yan W, Tan L, Meng-Shan L, Sheng S, Jun W, Fu-an W. SaPt-CNN-LSTM-AR-EA: a hybrid ensemble learning framework for time series-based multivariate DNA sequence prediction. PeerJ 2023; 11:e16192. [PMID: 37810796 PMCID: PMC10559882 DOI: 10.7717/peerj.16192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 09/06/2023] [Indexed: 10/10/2023] Open
Abstract
Biological sequence data mining is hot spot in bioinformatics. A biological sequence can be regarded as a set of characters. Time series is similar to biological sequences in terms of both representation and mechanism. Therefore, in the article, biological sequences are represented with time series to obtain biological time sequence (BTS). Hybrid ensemble learning framework (SaPt-CNN-LSTM-AR-EA) for BTS is proposed. Single-sequence and multi-sequence models are respectively constructed with self-adaption pre-training one-dimensional convolutional recurrent neural network and autoregressive fractional integrated moving average fused evolutionary algorithm. In DNA sequence experiments with six viruses, SaPt-CNN-LSTM-AR-EA realized the good overall prediction performance and the prediction accuracy and correlation respectively reached 1.7073 and 0.9186. SaPt-CNN-LSTM-AR-EA was compared with other five benchmark models so as to verify its effectiveness and stability. SaPt-CNN-LSTM-AR-EA increased the average accuracy by about 30%. The framework proposed in this article is significant in biology, biomedicine, and computer science, and can be widely applied in sequence splicing, computational biology, bioinformation, and other fields.
Collapse
Affiliation(s)
- Wu Yan
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou, Jiangxi, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| | - Li Tan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, China
| | - Li Meng-Shan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, China
| | - Sheng Sheng
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| | - Wang Jun
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| | - Wu Fu-an
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| |
Collapse
|
11
|
Wang Z, Deng Z, Zhang W, Lou Q, Choi KS, Wei Z, Wang L, Wu J. MMSMAPlus: a multi-view multi-scale multi-attention embedding model for protein function prediction. Brief Bioinform 2023:7187109. [PMID: 37258453 DOI: 10.1093/bib/bbad201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 04/16/2023] [Accepted: 05/08/2023] [Indexed: 06/02/2023] Open
Abstract
Protein is the most important component in organisms and plays an indispensable role in life activities. In recent years, a large number of intelligent methods have been proposed to predict protein function. These methods obtain different types of protein information, including sequence, structure and interaction network. Among them, protein sequences have gained significant attention where methods are investigated to extract the information from different views of features. However, how to fully exploit the views for effective protein sequence analysis remains a challenge. In this regard, we propose a multi-view, multi-scale and multi-attention deep neural model (MMSMA) for protein function prediction. First, MMSMA extracts multi-view features from protein sequences, including one-hot encoding features, evolutionary information features, deep semantic features and overlapping property features based on physiochemistry. Second, a specific multi-scale multi-attention deep network model (MSMA) is built for each view to realize the deep feature learning and preliminary classification. In MSMA, both multi-scale local patterns and long-range dependence from protein sequences can be captured. Third, a multi-view adaptive decision mechanism is developed to make a comprehensive decision based on the classification results of all the views. To further improve the prediction performance, an extended version of MMSMA, MMSMAPlus, is proposed to integrate homology-based protein prediction under the framework of multi-view deep neural model. Experimental results show that the MMSMAPlus has promising performance and is significantly superior to the state-of-the-art methods. The source code can be found at https://github.com/wzy-2020/MMSMAPlus.
Collapse
Affiliation(s)
- Zhongyu Wang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Wei Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Qiongdan Lou
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | | | - Zhisheng Wei
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| | - Lei Wang
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| | - Jing Wu
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| |
Collapse
|
12
|
Li M, Shi W, Zhang F, Zeng M, Li Y. A Deep Learning Framework for Predicting Protein Functions With Co-Occurrence of GO Terms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:833-842. [PMID: 35476573 DOI: 10.1109/tcbb.2022.3170719] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The understanding of protein functions is critical to many biological problems such as the development of new drugs and new crops. To reduce the huge gap between the increase of protein sequences and annotations of protein functions, many methods have been proposed to deal with this problem. These methods use Gene Ontology (GO) to classify the functions of proteins and consider one GO term as a class label. However, they ignore the co-occurrence of GO terms that is helpful for protein function prediction. We propose a new deep learning model, named DeepPFP-CO, which uses Graph Convolutional Network (GCN) to explore and capture the co-occurrence of GO terms to improve the protein function prediction performance. In this way, we can further deduce the protein functions by fusing the predicted propensity of the center function and its co-occurrence functions. We use Fmax and AUPR to evaluate the performance of DeepPFP-CO and compare DeepPFP-CO with state-of-the-art methods such as DeepGOPlus and DeepGOA. The computational results show that DeepPFP-CO outperforms DeepGOPlus and other methods. Moreover, we further analyze our model at the protein level. The results have demonstrated that DeepPFP-CO improves the performance of protein function prediction. DeepPFP-CO is available at https://csuligroup.com/DeepPFP/.
Collapse
|
13
|
Zheng Y, Young ND, Song J, Chang BC, Gasser RB. An informatic workflow for the enhanced annotation of excretory/secretory proteins of Haemonchus contortus. Comput Struct Biotechnol J 2023; 21:2696-2704. [PMID: 37143762 PMCID: PMC10151223 DOI: 10.1016/j.csbj.2023.03.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 03/16/2023] [Accepted: 03/16/2023] [Indexed: 03/19/2023] Open
Abstract
Major advances in genomic and associated technologies have demanded reliable bioinformatic tools and workflows for the annotation of genes and their products via comparative analyses using well-curated reference data sets, accessible in public repositories. However, the accurate in silico annotation of molecules (proteins) encoded in organisms (e.g., multicellular parasites) which are evolutionarily distant from those for which these extensive reference data sets are available, including invertebrate model organisms (e.g., Caenorhabditis elegans - free-living nematode, and Drosophila melanogaster - the vinegar fly) and vertebrate species (e.g., Homo sapiens and Mus musculus), remains a major challenge. Here, we constructed an informatic workflow for the enhanced annotation of biologically-important, excretory/secretory (ES) proteins ("secretome") encoded in the genome of a parasitic roundworm, called Haemonchus contortus (commonly known as the barber's pole worm). We critically evaluated the performance of five distinct methods, refined some of them, and then combined the use of all five methods to comprehensively annotate ES proteins, according to gene ontology, biological pathways and/or metabolic (enzymatic) processes. Then, using optimised parameter settings, we applied this workflow to comprehensively annotate 2591 of all 3353 proteins (77.3%) in the secretome of H. contortus. This result is a substantial improvement (10-25%) over previous annotations using individual, "off-the-shelf" algorithms and default settings, indicating the ready applicability of the present, refined workflow to gene/protein sequence data sets from a wide range of organisms in the Tree-of-Life.
Collapse
|
14
|
Kabir A, Shehu A. GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction. Biomolecules 2022; 12:1709. [PMID: 36421723 PMCID: PMC9687818 DOI: 10.3390/biom12111709] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 11/14/2022] [Accepted: 11/15/2022] [Indexed: 09/19/2023] Open
Abstract
Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feature space. A graph transformer learns a representation of GO terms while respecting their hierarchical relationships. The learned sequence and GO terms representations are combined and utilized for multi-label classification, with the labels corresponding to GO terms. The method is shown superior over recent representative GO prediction methods. The second major contribution in this paper is a deep investigation of different ways of constructing training and testing datasets. The paper shows that existing approaches under- or over-estimate the generalization power of a model. A novel approach is proposed to address these issues, resulting in a new benchmark dataset to rigorously evaluate and compare methods and advance the state-of-the-art.
Collapse
Affiliation(s)
- Anowarul Kabir
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
- Center for Advancing Human-Machine Partnerships, George Mason University, Fairfax, VA 22030, USA
- Department of Bioengineering, George Mason University, Fairfax, VA 22030, USA
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA
| |
Collapse
|
15
|
Li Y, Zeng M, Wu Y, Li Y, Li M. Accurate Prediction of Human Essential Proteins Using Ensemble Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3263-3271. [PMID: 34699365 DOI: 10.1109/tcbb.2021.3122294] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Essential proteins are considered the foundation of life as they are indispensable for the survival of living organisms. Computational methods for essential protein discovery provide a fast way to identify essential proteins. But most of them heavily rely on various biological information, especially protein-protein interaction networks, which limits their practical applications. With the rapid development of high-throughput sequencing technology, sequencing data has become the most accessible biological data. However, using only protein sequence information to predict essential proteins has limited accuracy. In this paper, we propose EP-EDL, an ensemble deep learning model using only protein sequence information to predict human essential proteins. EP-EDL integrates multiple classifiers to alleviate the class imbalance problem and to improve prediction accuracy and robustness. In each base classifier, we employ multi-scale text convolutional neural networks to extract useful features from protein sequence feature matrices with evolutionary information. Our computational results show that EP-EDL outperforms the state-of-the-art sequence-based methods. Furthermore, EP-EDL provides a more practical and flexible way for biologists to accurately predict essential proteins. The source code and datasets can be downloaded from https://github.com/CSUBioGroup/EP-EDL.
Collapse
|
16
|
Zhang Y, Hu Y, Li H, Liu X. Drug-protein interaction prediction via variational autoencoders and attention mechanisms. Front Genet 2022; 13:1032779. [PMID: 36313473 PMCID: PMC9614151 DOI: 10.3389/fgene.2022.1032779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 09/30/2022] [Indexed: 09/29/2023] Open
Abstract
During the process of drug discovery, exploring drug-protein interactions (DPIs) is a key step. With the rapid development of biological data, computer-aided methods are much faster than biological experiments. Deep learning methods have become popular and are mainly used to extract the characteristics of drugs and proteins for further DPIs prediction. Since the prediction of DPIs through machine learning cannot fully extract effective features, in our work, we propose a deep learning framework that uses variational autoencoders and attention mechanisms; it utilizes convolutional neural networks (CNNs) to obtain local features and attention mechanisms to obtain important information about drugs and proteins, which is very important for predicting DPIs. Compared with some machine learning methods on the C.elegans and human datasets, our approach provides a better effect. On the BindingDB dataset, its accuracy (ACC) and area under the curve (AUC) reach 0.862 and 0.913, respectively. To verify the robustness of the model, multiclass classification tasks are performed on Davis and KIBA datasets, and the ACC values reach 0.850 and 0.841, respectively, thus further demonstrating the effectiveness of the model.
Collapse
Affiliation(s)
- Yue Zhang
- School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, China
| | | | | | | |
Collapse
|
17
|
Sengupta K, Saha S, Halder AK, Chatterjee P, Nasipuri M, Basu S, Plewczynski D. PFP-GO: Integrating protein sequence, domain and protein-protein interaction information for protein function prediction using ranked GO terms. Front Genet 2022; 13:969915. [PMID: 36246645 PMCID: PMC9556876 DOI: 10.3389/fgene.2022.969915] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 08/31/2022] [Indexed: 11/13/2022] Open
Abstract
Protein function prediction is gradually emerging as an essential field in biological and computational studies. Though the latter has clinched a significant footprint, it has been observed that the application of computational information gathered from multiple sources has more significant influence than the one derived from a single source. Considering this fact, a methodology, PFP-GO, is proposed where heterogeneous sources like Protein Sequence, Protein Domain, and Protein-Protein Interaction Network have been processed separately for ranking each individual functional GO term. Based on this ranking, GO terms are propagated to the target proteins. While Protein sequence enriches the sequence-based information, Protein Domain and Protein-Protein Interaction Networks embed structural/functional and topological based information, respectively, during the phase of GO ranking. Performance analysis of PFP-GO is also based on Precision, Recall, and F-Score. The same was found to perform reasonably better when compared to the other existing state-of-art. PFP-GO has achieved an overall Precision, Recall, and F-Score of 0.67, 0.58, and 0.62, respectively. Furthermore, we check some of the top-ranked GO terms predicted by PFP-GO through multilayer network propagation that affect the 3D structure of the genome. The complete source code of PFP-GO is freely available at https://sites.google.com/view/pfp-go/.
Collapse
Affiliation(s)
- Kaustav Sengupta
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Sovan Saha
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, West Bengal, India
| | - Anup Kumar Halder
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| | - Dariusz Plewczynski
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| |
Collapse
|
18
|
Gao J, Lyu T, Xiong F, Wang J, Ke W, Li Z. Predicting the Survival of Cancer Patients With Multimodal Graph Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:699-709. [PMID: 34033545 DOI: 10.1109/tcbb.2021.3083566] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In recent years, cancer patients survival prediction holds important significance for worldwide health problems, and has gained many researchers attention in medical information communities. Cancer patients survival prediction can be seen the classification work which is a meaningful and challenging task. Nevertheless, research in this field is still limited. In this work, we design a novel Multimodal Graph Neural Network (MGNN)framework for predicting cancer survival, which explores the features of real-world multimodal data such as gene expression, copy number alteration and clinical data in a unified framework. Specifically, we first construct the bipartite graphs between patients and multimodal data to explore the inherent relation. Subsequently, the embedding of each patient on different bipartite graphs is obtained with graph neural network. Finally, a multimodal fusion neural layer is proposed to fuse the medical features from different modality data. Comprehensive experiments have been conducted on real-world datasets, which demonstrate the superiority of our modal with significant improvements against state-of-the-arts. Furthermore, the proposed MGNN is validated to be more robust on other four cancer datasets.
Collapse
|
19
|
Zhang F, Zhao B, Shi W, Li M, Kurgan L. DeepDISOBind: accurate prediction of RNA-, DNA- and protein-binding intrinsically disordered residues with deep multi-task learning. Brief Bioinform 2021; 23:6461158. [PMID: 34905768 DOI: 10.1093/bib/bbab521] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 10/30/2021] [Accepted: 11/14/2021] [Indexed: 12/14/2022] Open
Abstract
Proteins with intrinsically disordered regions (IDRs) are common among eukaryotes. Many IDRs interact with nucleic acids and proteins. Annotation of these interactions is supported by computational predictors, but to date, only one tool that predicts interactions with nucleic acids was released, and recent assessments demonstrate that current predictors offer modest levels of accuracy. We have developed DeepDISOBind, an innovative deep multi-task architecture that accurately predicts deoxyribonucleic acid (DNA)-, ribonucleic acid (RNA)- and protein-binding IDRs from protein sequences. DeepDISOBind relies on an information-rich sequence profile that is processed by an innovative multi-task deep neural network, where subsequent layers are gradually specialized to predict interactions with specific partner types. The common input layer links to a layer that differentiates protein- and nucleic acid-binding, which further links to layers that discriminate between DNA and RNA interactions. Empirical tests show that this multi-task design provides statistically significant gains in predictive quality across the three partner types when compared to a single-task design and a representative selection of the existing methods that cover both disorder- and structure-trained tools. Analysis of the predictions on the human proteome reveals that DeepDISOBind predictions can be encoded into protein-level propensities that accurately predict DNA- and RNA-binding proteins and protein hubs. DeepDISOBind is available at https://www.csuligroup.com/DeepDISOBind/.
Collapse
Affiliation(s)
- Fuhao Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - Wenbo Shi
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Min Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
20
|
Meng X, Li W, Peng X, Li Y, Li M. Protein interaction networks: centrality, modularity, dynamics, and applications. FRONTIERS OF COMPUTER SCIENCE 2021; 15:156902. [DOI: 10.1007/s11704-020-8179-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2018] [Accepted: 08/12/2020] [Indexed: 01/03/2025]
|
21
|
Vu TTD, Jung J. Protein function prediction with gene ontology: from traditional to deep learning models. PeerJ 2021; 9:e12019. [PMID: 34513334 PMCID: PMC8395570 DOI: 10.7717/peerj.12019] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 07/29/2021] [Indexed: 11/25/2022] Open
Abstract
Protein function prediction is a crucial part of genome annotation. Prediction methods have recently witnessed rapid development, owing to the emergence of high-throughput sequencing technologies. Among the available databases for identifying protein function terms, Gene Ontology (GO) is an important resource that describes the functional properties of proteins. Researchers are employing various approaches to efficiently predict the GO terms. Meanwhile, deep learning, a fast-evolving discipline in data-driven approach, exhibits impressive potential with respect to assigning GO terms to amino acid sequences. Herein, we reviewed the currently available computational GO annotation methods for proteins, ranging from conventional to deep learning approach. Further, we selected some suitable predictors from among the reviewed tools and conducted a mini comparison of their performance using a worldwide challenge dataset. Finally, we discussed the remaining major challenges in the field, and emphasized the future directions for protein function prediction with GO.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| |
Collapse
|
22
|
NPF:network propagation for protein function prediction. BMC Bioinformatics 2020; 21:355. [PMID: 32787776 PMCID: PMC7430911 DOI: 10.1186/s12859-020-03663-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2020] [Accepted: 07/14/2020] [Indexed: 11/29/2022] Open
Abstract
Background The accurate annotation of protein functions is of great significance in elucidating the phenomena of life, treating disease and developing new medicines. Various methods have been developed to facilitate the prediction of these functions by combining protein interaction networks (PINs) with multi-omics data. However, it is still challenging to make full use of multiple biological to improve the performance of functions annotation. Results We presented NPF (Network Propagation for Functions prediction), an integrative protein function predicting framework assisted by network propagation and functional module detection, for discovering interacting partners with similar functions to target proteins. NPF leverages knowledge of the protein interaction network architecture and multi-omics data, such as domain annotation and protein complex information, to augment protein-protein functional similarity in a propagation manner. We have verified the great potential of NPF for accurately inferring protein functions. According to the comprehensive evaluation of NPF, it delivered a better performance than other competing methods in terms of leave-one-out cross-validation and ten-fold cross validation. Conclusions We demonstrated that network propagation, together with multi-omics data, can both discover more partners with similar function, and is unconstricted by the “small-world” feature of protein interaction networks. We conclude that the performance of function prediction depends greatly on whether we can extract and exploit proper functional information of similarity from protein correlations.
Collapse
|