1
|
Song Y, Yuan Q, Chen S, Zeng Y, Zhao H, Yang Y. Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures. Nat Commun 2024; 15:8180. [PMID: 39294165 PMCID: PMC11411130 DOI: 10.1038/s41467-024-52533-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 09/11/2024] [Indexed: 09/20/2024] Open
Abstract
Enzymes are crucial in numerous biological processes, with the Enzyme Commission (EC) number being a commonly used method for defining enzyme function. However, current EC number prediction technologies have not fully recognized the importance of enzyme active sites and structural characteristics. Here, we propose GraphEC, a geometric graph learning-based EC number predictor using the ESMFold-predicted structures and a pre-trained protein language model. Specifically, we first construct a model to predict the enzyme active sites, which is utilized to predict the EC number. The prediction is further improved through a label diffusion algorithm by incorporating homology information. In parallel, the optimum pH of enzymes is predicted to reflect the enzyme-catalyzed reactions. Experiments demonstrate the superior performance of our model in predicting active sites, EC numbers, and optimum pH compared to other state-of-the-art methods. Additional analysis reveals that GraphEC is capable of extracting functional information from protein structures, emphasizing the effectiveness of geometric graph learning. This technology can be used to identify unannotated enzyme functions, as well as to predict their active sites and optimum pH, with the potential to advance research in synthetic biology, genomics, and other fields.
Collapse
Affiliation(s)
- Yidong Song
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Qianmu Yuan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China
- High Performance Computing Department, National Supercomputing Center in Shenzhen, Shenzhen, Guangdong, China
| | - Sheng Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Yuansong Zeng
- School of Big Data & Software Engineering, Chongqing University, Chongqing, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China.
- Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou, China.
| |
Collapse
|
2
|
Zou H. iRNA5hmC-HOC: High-order correlation information for identifying RNA 5-hydroxymethylcytosine modification. J Bioinform Comput Biol 2022; 20:2250017. [DOI: 10.1142/s0219720022500172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
3
|
Zou H, Yang F, Yin Z. iTTCA-MFF: identifying tumor T cell antigens based on multiple feature fusion. Immunogenetics 2022; 74:447-454. [PMID: 35246701 DOI: 10.1007/s00251-022-01258-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Accepted: 02/26/2022] [Indexed: 11/05/2022]
Abstract
Cancer is a terrible disease, recent studies reported that tumor T cell antigens (TTCAs) may play a promising role in cancer treatment. Since experimental methods are still expensive and time-consuming, it is highly desirable to develop automatic computational methods to identify tumor T cell antigens from the huge amount of natural and synthetic peptides. Hence, in this study, a novel computational model called iTTCA-MFF was proposed to identify TTCAs. In order to describe the sequence effectively, the physicochemical (PC) properties of amino acid and residue pairwise energy content matrix (RECM) were firstly employed to encode peptide sequences. Then, two different approaches including covariance and Pearson's correlation coefficient (PCC) were used to collect discriminative information from PC and RECM matrixes. Next, an effective feature selection approach called the least absolute shrinkage and selection operator (LAASO) was adopted to select the optimal features. These selected optimal features were fed into support vector machine (SVM) for identifying TTCAs. We performed experiments on two different datasets, experimental results indicated that the proposed method is promising and may play a complementary role to the existing methods for identifying TTCAs. The datasets and codes can be available at https://figshare.com/articles/online_resource/iTTCA-MFF/17636120 .
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, 330003, China.
| | - Fan Yang
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, 330003, China
| | - Zhijian Yin
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, 330003, China
| |
Collapse
|
4
|
Khan KA, Memon SA, Naveed H. A hierarchical deep learning based approach for multi-functional enzyme classification. Protein Sci 2021; 30:1935-1945. [PMID: 34118089 DOI: 10.1002/pro.4146] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 06/10/2021] [Accepted: 06/11/2021] [Indexed: 11/09/2022]
Abstract
Enzymes are critical proteins in every organism. They speed up essential chemical reactions, help fight diseases, and have a wide use in the pharmaceutical and manufacturing industries. Wet lab experiments to figure out an enzyme's function are time consuming and expensive. Therefore, the need for computational approaches to address this problem are becoming necessary. Usually, an enzyme is extremely specific in performing its function. However, there exist enzymes that can perform multiple functions. A multi-functional enzyme has vast potential as it reduces the need to discover/use different enzymes for different functions. We propose an approach to predict a multi-functional enzyme's function up to the most specific fourth level of the hierarchy of the Enzyme Commission (EC) number. Previous studies can only predict the function of the enzyme till level 1. Using a dataset of 2,583 multi-functional enzymes, we achieved a hierarchical subset accuracy of 71.4% and a Macro F1 Score of 96.1% at the fourth level. The robustness of the network was further tested on a multi-functional isoforms dataset. Our method is broadly applicable and may be used to discover better enzymes. The web-server can be freely accessed at http://hecnet.cbrlab.org/.
Collapse
Affiliation(s)
- Kinaan Aamir Khan
- Computational Biology Research Lab, National University of Computer and Emerging Sciences, Islamabad, Pakistan
| | - Safyan Aman Memon
- Computational Biology Research Lab, National University of Computer and Emerging Sciences, Islamabad, Pakistan
| | - Hammad Naveed
- Computational Biology Research Lab, National University of Computer and Emerging Sciences, Islamabad, Pakistan
| |
Collapse
|
5
|
Feehan R, Montezano D, Slusky JSG. Machine learning for enzyme engineering, selection and design. Protein Eng Des Sel 2021; 34:gzab019. [PMID: 34296736 PMCID: PMC8299298 DOI: 10.1093/protein/gzab019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Revised: 06/18/2021] [Accepted: 06/23/2021] [Indexed: 11/15/2022] Open
Abstract
Machine learning is a useful computational tool for large and complex tasks such as those in the field of enzyme engineering, selection and design. In this review, we examine enzyme-related applications of machine learning. We start by comparing tools that can identify the function of an enzyme and the site responsible for that function. Then we detail methods for optimizing important experimental properties, such as the enzyme environment and enzyme reactants. We describe recent advances in enzyme systems design and enzyme design itself. Throughout we compare and contrast the data and algorithms used for these tasks to illustrate how the algorithms and data can be best used by future designers.
Collapse
Affiliation(s)
- Ryan Feehan
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
| | - Daniel Montezano
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
| | - Joanna S G Slusky
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
- Department of Molecular Biosciences, The University of Kansas, 1200 Sunnyside Ave. Lawrence, KS 66045-7600, USA
| |
Collapse
|
6
|
Su C, Gong JS, Qin J, Li H, Li H, Xu ZH, Shi JS. The tale of a versatile enzyme: Molecular insights into keratinase for its industrial dissemination. Biotechnol Adv 2020; 45:107655. [PMID: 33186607 DOI: 10.1016/j.biotechadv.2020.107655] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2020] [Revised: 10/30/2020] [Accepted: 11/02/2020] [Indexed: 01/02/2023]
Abstract
Keratinases are unique among proteolytic enzymes for their ability to degrade recalcitrant insoluble proteins, and they are of critical importance in keratin waste management. Over the past few decades, researchers have focused on discovering keratinase producers, as well as producing and characterizing keratinases. The application potential of keratinases has been investigated in the feed, fertilizer, leathering, detergent, cosmetic, and medical industries. However, the commercial availability of keratinases is still limited due to poor productivity and properties, such as thermostability, storage stability and resistance to organic reagents. Advances in molecular biotechnology have provided powerful tools for enhancing the production and functional properties of keratinase. This critical review systematically summarizes the application potential of keratinase, and in particular certain newly discovered catalytic capabilities. Furthermore, we provide comprehensive insight into mechanistic and molecular aspects of keratinases including analysis of gene sequences and protein structures. In addition, development and current advances in protein engineering of keratinases are summarized and discussed, revealing that the engineering of protein domains such as signal peptides and pro-peptides has become an important strategy to increase production of keratinases. Finally, prospects for further development are also proposed, indicating that advanced protein engineering technologies will lead to improved and additional commercial keratinases for various industrial applications.
Collapse
Affiliation(s)
- Chang Su
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, School of Pharmaceutical Sciences, Jiangnan University, Wuxi 214122, PR China
| | - Jin-Song Gong
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, School of Pharmaceutical Sciences, Jiangnan University, Wuxi 214122, PR China.
| | - Jiufu Qin
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark
| | - Heng Li
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, School of Pharmaceutical Sciences, Jiangnan University, Wuxi 214122, PR China
| | - Hui Li
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, School of Pharmaceutical Sciences, Jiangnan University, Wuxi 214122, PR China
| | - Zheng-Hong Xu
- National Engineering Laboratory for Cereal Fermentation Technology, School of Biotechnology, Jiangnan University, Wuxi 214122, PR China; Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, Wuxi 214122, PR China
| | - Jin-Song Shi
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, School of Pharmaceutical Sciences, Jiangnan University, Wuxi 214122, PR China.
| |
Collapse
|
7
|
Zhan XK, You ZH, Li LP, Li Y, Wang Z, Pan J. Using Random Forest Model Combined With Gabor Feature to Predict Protein-Protein Interaction From Protein Sequence. Evol Bioinform Online 2020; 16:1176934320934498. [PMID: 32655275 PMCID: PMC7328357 DOI: 10.1177/1176934320934498] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Accepted: 05/20/2020] [Indexed: 12/12/2022] Open
Abstract
Protein-protein interactions (PPIs) play a crucial role in the life cycles of
living cells. Thus, it is important to understand the underlying mechanisms of
PPIs. Although many high-throughput technologies have generated large amounts of
PPI data in different organisms, the experiments for detecting PPIs are still
costly and time-consuming. Therefore, novel computational methods are urgently
needed for predicting PPIs. For this reason, developing a new computational
method for predicting PPIs is drawing more and more attention. In this study, we
proposed a novel computational method based on texture feature of protein
sequence for predicting PPIs. Especially, the Gabor feature is used to extract
texture feature and protein evolutionary information from Position-Specific
Scoring Matrix, which is generated by Position-Specific Iterated Basic Local
Alignment Search Tool. Then, random forest–based classifiers are used to infer
the protein interactions. When performed on PPI data sets of yeast,
human, and Helicobacter pylori, we obtained good
results with average accuracies of 92.10%, 97.03%, and 86.45%, respectively. To
better evaluate the proposed method, we compared Gabor feature, Discrete Cosine
Transform, and Local Phase Quantization. Our results show that the proposed
method is both feasible and stable and the Gabor feature descriptor is reliable
in extracting protein sequence information. Furthermore, additional experiments
have been conducted to predict PPIs of other 4 species data sets. The promising
results indicate that our proposed method is both powerful and robust.
Collapse
Affiliation(s)
- Xin-Ke Zhan
- School of Information Engineering, Xijing University, Xi'an, China
| | - Zhu-Hong You
- School of Information Engineering, Xijing University, Xi'an, China
| | - Li-Ping Li
- School of Information Engineering, Xijing University, Xi'an, China
| | - Yang Li
- School of Information Engineering, Xijing University, Xi'an, China
| | - Zheng Wang
- School of Information Engineering, Xijing University, Xi'an, China
| | - Jie Pan
- School of Information Engineering, Xijing University, Xi'an, China
| |
Collapse
|
8
|
Abstract
During the last three decades or so, many efforts have been made to study the protein cleavage
sites by some disease-causing enzyme, such as HIV (Human Immunodeficiency Virus) protease
and SARS (Severe Acute Respiratory Syndrome) coronavirus main proteinase. It has become increasingly
clear <i>via</i> this mini-review that the motivation driving the aforementioned studies is quite wise,
and that the results acquired through these studies are very rewarding, particularly for developing peptide
drugs.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
9
|
|
10
|
Semwal R, Aier I, Tyagi P, Varadwaj PK. DeEPn: a deep neural network based tool for enzyme functional annotation. J Biomol Struct Dyn 2020; 39:2733-2743. [PMID: 32274968 DOI: 10.1080/07391102.2020.1754292] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
With the advancement of high throughput techniques, the discovery rate of enzyme sequences has increased significantly in the recent past. All of these raw sequences are required to be precisely mapped to their respective functional attributes, which helps in deciphering their biological role. In the recent past, various prediction models have been proposed to predict the enzyme functional class; however, all of these models were able to quantify at most six functional enzyme classes (EC1 to EC6) out of existing seven functional classes, making these approaches inappropriate for handling enzymes corresponding to the seventh functional class (EC7). In this study, a Deep Neural Network-based approach, DeEPn, has been proposed, which can quantify enzymes corresponding to all seven functional classes with high precision and accuracy. The proposed model was compared with two recently developed tools, ECPred and SVM-Prot. The result demonstrated that DeEPn outperformed ECPred and SVM-Prot in terms of predictive quality. The DeEPn tool has been hosted as a web-based tool at https://bioserver.iiita.ac.in/DeEPn/.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Rahul Semwal
- Department of Information Technology (Bioinformatics), Indian Institute of Information Technology Allahabad, Allahabad, Uttar Pradesh, India
| | - Imlimaong Aier
- Department of Bioinformatics and Applied Science, Indian Institute of Information Technology, Allahabad, Allahabad, Uttar Pradesh, India
| | - Pankaj Tyagi
- Department of Information Technology (Bioinformatics), Indian Institute of Information Technology Allahabad, Allahabad, Uttar Pradesh, India
| | - Pritish Kumar Varadwaj
- Department of Bioinformatics and Applied Science, Indian Institute of Information Technology, Allahabad, Allahabad, Uttar Pradesh, India
| |
Collapse
|
11
|
Some illuminating remarks on molecular genetics and genomics as well as drug development. Mol Genet Genomics 2020; 295:261-274. [PMID: 31894399 DOI: 10.1007/s00438-019-01634-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 02/07/2023]
Abstract
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Collapse
|
12
|
Shao YT, Liu XX, Lu Z, Chou KC. pLoc_Deep-mHum: Predict Subcellular Localization of Human Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.127042] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
13
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
14
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
15
|
Chou KC. Proposing Pseudo Amino Acid Components is an Important Milestone for Proteome and Genome Analyses. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09910-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
16
|
|
17
|
Zou Z, Tian S, Gao X, Li Y. mlDEEPre: Multi-Functional Enzyme Function Prediction With Hierarchical Multi-Label Deep Learning. Front Genet 2019; 9:714. [PMID: 30723495 PMCID: PMC6349967 DOI: 10.3389/fgene.2018.00714] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2018] [Accepted: 12/20/2018] [Indexed: 12/26/2022] Open
Abstract
As a great challenge in bioinformatics, enzyme function prediction is a significant step toward designing novel enzymes and diagnosing enzyme-related diseases. Existing studies mainly focus on the mono-functional enzyme function prediction. However, the number of multi-functional enzymes is growing rapidly, which requires novel computational methods to be developed. In this paper, following our previous work, DEEPre, which uses deep learning to annotate mono-functional enzyme's function, we propose a novel method, mlDEEPre, which is designed specifically for predicting the functionalities of multi-functional enzymes. By adopting a novel loss function, associated with the relationship between different labels, and a self-adapted label assigning threshold, mlDEEPre can accurately and efficiently perform multi-functional enzyme prediction. Extensive experiments also show that mlDEEPre can outperform the other methods in predicting whether an enzyme is a mono-functional or a multi-functional enzyme (mono-functional vs. multi-functional), as well as the main class prediction across different criteria. Furthermore, due to the flexibility of mlDEEPre and DEEPre, mlDEEPre can be incorporated into DEEPre seamlessly, which enables the updated DEEPre to handle both mono-functional and multi-functional predictions without human intervention.
Collapse
Affiliation(s)
- Zhenzhen Zou
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Shuye Tian
- Department of Biology, Southern University of Science and Technology (SUSTC), Shenzhen, China
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Yu Li
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| |
Collapse
|
18
|
Li Y, Wang S, Umarov R, Xie B, Fan M, Li L, Gao X. DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics 2018; 34:760-769. [PMID: 29069344 PMCID: PMC6030869 DOI: 10.1093/bioinformatics/btx680] [Citation(s) in RCA: 146] [Impact Index Per Article: 20.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 10/20/2017] [Indexed: 11/15/2022] Open
Abstract
Motivation Annotation of enzyme function has a broad range of applications, such as metagenomics, industrial biotechnology, and diagnosis of enzyme deficiency-caused diseases. However, the time and resource required make it prohibitively expensive to experimentally determine the function of every enzyme. Therefore, computational enzyme function prediction has become increasingly important. In this paper, we develop such an approach, determining the enzyme function by predicting the Enzyme Commission number. Results We propose an end-to-end feature selection and classification model training approach, as well as an automatic and robust feature dimensionality uniformization method, DEEPre, in the field of enzyme function prediction. Instead of extracting manually crafted features from enzyme sequences, our model takes the raw sequence encoding as inputs, extracting convolutional and sequential features from the raw encoding based on the classification result to directly improve the prediction performance. The thorough cross-fold validation experiments conducted on two large-scale datasets show that DEEPre improves the prediction performance over the previous state-of-the-art methods. In addition, our server outperforms five other servers in determining the main class of enzymes on a separate low-homology dataset. Two case studies demonstrate DEEPre’s ability to capture the functional difference of enzyme isoforms. Availability and implementation The server could be accessed freely at http://www.cbrc.kaust.edu.sa/DEEPre. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yu Li
- Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), Computer, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Sheng Wang
- Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), Computer, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Ramzan Umarov
- Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), Computer, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Bingqing Xie
- Computer Science Department, Illinois Institute of Technology, Chicago, IL 60616, USA
| | - Ming Fan
- Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Lihua Li
- Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Xin Gao
- Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), Computer, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
19
|
Amidi S, Amidi A, Vlachakis D, Paragios N, Zacharaki EI. Automatic single- and multi-label enzymatic function prediction by machine learning. PeerJ 2017; 5:e3095. [PMID: 28367366 PMCID: PMC5374972 DOI: 10.7717/peerj.3095] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2016] [Accepted: 02/15/2017] [Indexed: 11/25/2022] Open
Abstract
The number of protein structures in the PDB database has been increasing more than 15-fold since 1999. The creation of computational models predicting enzymatic function is of major importance since such models provide the means to better understand the behavior of newly discovered enzymes when catalyzing chemical reactions. Until now, single-label classification has been widely performed for predicting enzymatic function limiting the application to enzymes performing unique reactions and introducing errors when multi-functional enzymes are examined. Indeed, some enzymes may be performing different reactions and can hence be directly associated with multiple enzymatic functions. In the present work, we propose a multi-label enzymatic function classification scheme that combines structural and amino acid sequence information. We investigate two fusion approaches (in the feature level and decision level) and assess the methodology for general enzymatic function prediction indicated by the first digit of the enzyme commission (EC) code (six main classes) on 40,034 enzymes from the PDB database. The proposed single-label and multi-label models predict correctly the actual functional activities in 97.8% and 95.5% (based on Hamming-loss) of the cases, respectively. Also the multi-label model predicts all possible enzymatic reactions in 85.4% of the multi-labeled enzymes when the number of reactions is unknown. Code and datasets are available at https://figshare.com/s/a63e0bafa9b71fc7cbd7.
Collapse
Affiliation(s)
- Shervine Amidi
- Department of Applied Mathematics, Center for Visual Computing, Ecole Centrale de Paris (CentraleSupélec), Châtenay-Malabry, France
| | - Afshine Amidi
- Department of Applied Mathematics, Center for Visual Computing, Ecole Centrale de Paris (CentraleSupélec), Châtenay-Malabry, France
| | - Dimitrios Vlachakis
- MDAKM Group, Department of Computer Engineering and Informatics, University of Patras, Patras, Greece
| | - Nikos Paragios
- Department of Applied Mathematics, Center for Visual Computing, Ecole Centrale de Paris (CentraleSupélec), Châtenay-Malabry, France
- Equipe GALEN, INRIA Saclay, Orsay, France
| | - Evangelia I. Zacharaki
- Department of Applied Mathematics, Center for Visual Computing, Ecole Centrale de Paris (CentraleSupélec), Châtenay-Malabry, France
- Equipe GALEN, INRIA Saclay, Orsay, France
| |
Collapse
|
20
|
Wan S, Mak MW, Kung SY. Ensemble Linear Neighborhood Propagation for Predicting Subchloroplast Localization of Multi-Location Proteins. J Proteome Res 2016; 15:4755-4762. [DOI: 10.1021/acs.jproteome.6b00686] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Shibiao Wan
- Department
of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Man-Wai Mak
- Department
of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Sun-Yuan Kung
- Department
of Electrical Engineering, Princeton University, New Jersey 08540, United States
| |
Collapse
|