1
|
Zakharov OS, Rudik AV, Filimonov DA, Lagunin AA. Prediction of Protein Secondary Structures Based on Substructural Descriptors of Molecular Fragments. Int J Mol Sci 2024; 25:12525. [PMID: 39684237 DOI: 10.3390/ijms252312525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Revised: 11/15/2024] [Accepted: 11/19/2024] [Indexed: 12/18/2024] Open
Abstract
The accurate prediction of secondary structures of proteins (SSPs) is a critical challenge in molecular biology and structural bioinformatics. Despite recent advancements, this task remains complex and demands further exploration. This study presents a novel approach to SSP prediction using atom-centric substructural multilevel neighborhoods of atoms (MNA) descriptors for protein molecular fragments. A dataset comprising over 335,000 SSPs, annotated by the Dictionary of Secondary Structure in Proteins (DSSP) software from 37,000 proteins, was constructed from Protein Data Bank (PDB) records with a resolution of 2 Å or better. Protein fragments were converted into structural formulae using the RDKit Python package and stored in SD files using the MOL V3000 format. Classification sequence-structure-property relationships (SSPR) models were developed with varying levels of MNA descriptors and a Bayesian algorithm implemented in MultiPASS software. The average prediction accuracy (AUC) for eight SSP types, calculated via leave-one-out cross-validation, was 0.902. For independent test sets (ASTRAL and CB513 datasets), the best SSPR models achieved AUC, Q3, and Q8 values of 0.860, 77.32%, 70.92% and 0.889, 78.78%, 74.74%, respectively. Based on the created models, a freely available web application MNA-PSS-Pred was developed.
Collapse
Affiliation(s)
- Oleg S Zakharov
- Department of Bioinformatics, Pirogov Russian National Research Medical University, 117997 Moscow, Russia
| | - Anastasia V Rudik
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia
| | - Dmitry A Filimonov
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia
| | - Alexey A Lagunin
- Department of Bioinformatics, Pirogov Russian National Research Medical University, 117997 Moscow, Russia
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia
| |
Collapse
|
2
|
Lefin N, Herrera-Belén L, Farias JG, Beltrán JF. Review and perspective on bioinformatics tools using machine learning and deep learning for predicting antiviral peptides. Mol Divers 2024; 28:2365-2374. [PMID: 37626205 DOI: 10.1007/s11030-023-10718-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 08/15/2023] [Indexed: 08/27/2023]
Abstract
Viruses constitute a constant threat to global health and have caused millions of human and animal deaths throughout human history. Despite advances in the discovery of antiviral compounds that help fight these pathogens, finding a solution to this problem continues to be a task that consumes time and financial resources. Currently, artificial intelligence (AI) has revolutionized many areas of the biological sciences, making it possible to decipher patterns in amino acid sequences that encode different functions and activities. Within the field of AI, machine learning, and deep learning algorithms have been used to discover antimicrobial peptides. Due to their effectiveness and specificity, antimicrobial peptides (AMPs) hold excellent promise for treating various infections caused by pathogens. Antiviral peptides (AVPs) are a specific type of AMPs that have activity against certain viruses. Unlike the research focused on the development of tools and methods for the prediction of antimicrobial peptides, those related to the prediction of AVPs are still scarce. Given the significance of AVPs as potential pharmaceutical options for human and animal health and the ongoing AI revolution, we have reviewed and summarized the current machine learning and deep learning-based tools and methods available for predicting these types of peptides.
Collapse
Affiliation(s)
- Nicolás Lefin
- Department of Chemical Engineering, Faculty of Engineering and Science, University of La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Lisandra Herrera-Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomás, Temuco, Chile
| | - Jorge G Farias
- Department of Chemical Engineering, Faculty of Engineering and Science, University of La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Jorge F Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, University of La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile.
| |
Collapse
|
3
|
Chen Y, Chen G, Chen CYC. MFTrans: A multi-feature transformer network for protein secondary structure prediction. Int J Biol Macromol 2024; 267:131311. [PMID: 38599417 DOI: 10.1016/j.ijbiomac.2024.131311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2023] [Revised: 03/21/2024] [Accepted: 03/30/2024] [Indexed: 04/12/2024]
Abstract
In the rapidly evolving field of computational biology, accurate prediction of protein secondary structures is crucial for understanding protein functions, facilitating drug discovery, and advancing disease diagnostics. In this paper, we propose MFTrans, a deep learning-based multi-feature fusion network aimed at enhancing the precision and efficiency of Protein Secondary Structure Prediction (PSSP). This model employs a Multiple Sequence Alignment (MSA) Transformer in combination with a multi-view deep learning architecture to effectively capture both global and local features of protein sequences. MFTrans integrates diverse features generated by protein sequences, including MSA, sequence information, evolutionary information, and hidden state information, using a multi-feature fusion strategy. The MSA Transformer is utilized to interleave row and column attention across the input MSA, while a Transformer encoder and decoder are introduced to enhance the extracted high-level features. A hybrid network architecture, combining a convolutional neural network with a bidirectional Gated Recurrent Unit (BiGRU) network, is used to further extract high-level features after feature fusion. In independent tests, our experimental results show that MFTrans has superior generalization ability, outperforming other state-of-the-art PSSP models by 3 % on average on public benchmarks including CASP12, CASP13, CASP14, TEST2016, TEST2018, and CB513. Case studies further highlight its advanced performance in predicting mutation sites. MFTrans contributes significantly to the protein science field, opening new avenues for drug discovery, disease diagnosis, and protein.
Collapse
Affiliation(s)
- Yifu Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, Guangdong 518107, China
| | - Guanxing Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, Guangdong 518107, China
| | - Calvin Yu-Chian Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, Guangdong 518107, China; AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, Guangdong 518055, China; School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, Guangdong 518055, China; Department of Medical Research, China Medical University Hospital, Taichung 40447, Taiwan; Department of Bioinformatics and Medical Engineering, Asia University, Taichung 41354, Taiwan.
| |
Collapse
|
4
|
Atif HB, Alvi H, Naveed H. Masked Language Modeling for Resource Constrained Biological Natural Language Processing. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2023; 2023:1-5. [PMID: 38083556 DOI: 10.1109/embc40787.2023.10340499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2023]
Abstract
Recent advances in Natural Language Processing (NLP) have produced state of the art results on several sequence to sequence (seq2seq) tasks. Enhancements in embedders and their training methodologies have shown significant improvement on downstream tasks. Word vector models like Word2Vec, FastText & Glove were widely used over one-hot encoded vectors for years until the advent of deep contextualized embedders. Protein sequences consist of 20 naturally occurring amino acids that can be treated as the language of nature. These amino acids in combinations with each other makeup the biological functions. The choice of vector representation and architecture design for a biological task is highly dependent upon the nature of the task. We utilize unlabelled protein sequences to train a Convolution and Gated Recurrent Network (CGRN) embedder using Masked Language Modeling (MLM) technique that shows significant performance boost under resource constraint setting on two downstream tasks i.e., F1-score(Q8) of 73.1% on Secondary Structure Prediction (SSP) & F1-score of 84% on Intrinsically Disordered Region Prediction (IDRP). We also compare different architectures on downstream tasks to show the impact of the nature of biological task on the performance of the model.
Collapse
|
5
|
Yuan L, Ma Y, Liu Y. Ensemble deep learning models for protein secondary structure prediction using bidirectional temporal convolution and bidirectional long short-term memory. Front Bioeng Biotechnol 2023; 11:1051268. [PMID: 36860882 PMCID: PMC9968878 DOI: 10.3389/fbioe.2023.1051268] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 02/03/2023] [Indexed: 02/16/2023] Open
Abstract
Protein secondary structure prediction (PSSP) is a challenging task in computational biology. However, existing models with deep architectures are not sufficient and comprehensive for deep long-range feature extraction of long sequences. This paper proposes a novel deep learning model to improve Protein secondary structure prediction. In the model, our proposed bidirectional temporal convolutional network (BTCN) can extract the bidirectional deep local dependencies in protein sequences segmented by the sliding window technique, the bidirectional long short-term memory (BLSTM) network can extract the global interactions between residues, and our proposed multi-scale bidirectional temporal convolutional network (MSBTCN) can further capture the bidirectional multi-scale long-range features of residues while preserving the hidden layer information more comprehensively. In particular, we also propose that fusing the features of 3-state and 8-state Protein secondary structure prediction can further improve the prediction accuracy. Moreover, we also propose and compare multiple novel deep models by combining bidirectional long short-term memory with temporal convolutional network (TCN), reverse temporal convolutional network (RTCN), multi-scale temporal convolutional network (multi-scale bidirectional temporal convolutional network), bidirectional temporal convolutional network and multi-scale bidirectional temporal convolutional network, respectively. Furthermore, we demonstrate that the reverse prediction of secondary structure outperforms the forward prediction, suggesting that amino acids at later positions have a greater impact on secondary structure recognition. Experimental results on benchmark datasets including CASP10, CASP11, CASP12, CASP13, CASP14, and CB513 show that our methods achieve better prediction performance compared to five state-of-the-art methods.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Yuming Ma
- *Correspondence: Yuming Ma, ; Yihui Liu,
| | - Yihui Liu
- *Correspondence: Yuming Ma, ; Yihui Liu,
| |
Collapse
|
6
|
Yuan L, Ma Y, Liu Y. Protein secondary structure prediction based on Wasserstein generative adversarial networks and temporal convolutional networks with convolutional block attention modules. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:2203-2218. [PMID: 36899529 DOI: 10.3934/mbe.2023102] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
As an important task in bioinformatics, protein secondary structure prediction (PSSP) is not only beneficial to protein function research and tertiary structure prediction, but also to promote the design and development of new drugs. However, current PSSP methods cannot sufficiently extract effective features. In this study, we propose a novel deep learning model WGACSTCN, which combines Wasserstein generative adversarial network with gradient penalty (WGAN-GP), convolutional block attention module (CBAM) and temporal convolutional network (TCN) for 3-state and 8-state PSSP. In the proposed model, the mutual game of generator and discriminator in WGAN-GP module can effectively extract protein features, and our CBAM-TCN local extraction module can capture key deep local interactions in protein sequences segmented by sliding window technique, and the CBAM-TCN long-range extraction module can further capture the key deep long-range interactions in sequences. We evaluate the performance of the proposed model on seven benchmark datasets. Experimental results show that our model exhibits better prediction performance compared to the four state-of-the-art models. The proposed model has strong feature extraction ability, which can extract important information more comprehensively.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yuming Ma
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yihui Liu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| |
Collapse
|
7
|
Mufassirin MMM, Newton MAH, Sattar A. Artificial intelligence for template-free protein structure prediction: a comprehensive review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10350-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
8
|
Yuan L, Hu X, Ma Y, Liu Y. DLBLS_SS: protein secondary structure prediction using deep learning and broad learning system. RSC Adv 2022; 12:33479-33487. [PMID: 36505696 PMCID: PMC9682407 DOI: 10.1039/d2ra06433b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/16/2022] [Indexed: 11/24/2022] Open
Abstract
Protein secondary structure prediction (PSSP) is not only beneficial to the study of protein structure and function but also to the development of drugs. As a challenging task in computational biology, experimental methods for PSSP are time-consuming and expensive. In this paper, we propose a novel PSSP model DLBLS_SS based on deep learning and broad learning system (BLS) to predict 3-state and 8-state secondary structure. We first use a bidirectional long short-term memory (BLSTM) network to extract global features in residue sequences. Then, our proposed SEBTCN based on temporal convolutional networks (TCN) and channel attention can capture bidirectional key long-range dependencies in sequences. We also use BLS to rapidly optimize fused features while further capturing local interactions between residues. We conduct extensive experiments on public test sets including CASP10, CASP11, CASP12, CASP13, CASP14 and CB513 to evaluate the performance of the model. Experimental results show that our model exhibits better 3-state and 8-state PSSP performance compared to five state-of-the-art models.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Xiaopei Hu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Yuming Ma
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Yihui Liu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| |
Collapse
|
9
|
Ismi DP, Pulungan R, Afiahayati. Deep learning for protein secondary structure prediction: Pre and post-AlphaFold. Comput Struct Biotechnol J 2022; 20:6271-6286. [PMID: 36420164 PMCID: PMC9678802 DOI: 10.1016/j.csbj.2022.11.012] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 11/05/2022] [Accepted: 11/05/2022] [Indexed: 11/13/2022] Open
Abstract
This paper aims to provide a comprehensive review of the trends and challenges of deep neural networks for protein secondary structure prediction (PSSP). In recent years, deep neural networks have become the primary method for protein secondary structure prediction. Previous studies showed that deep neural networks had uplifted the accuracy of three-state secondary structure prediction to more than 80%. Favored deep learning methods, such as convolutional neural networks, recurrent neural networks, inception networks, and graph neural networks, have been implemented in protein secondary structure prediction. Methods adapted from natural language processing (NLP) and computer vision are also employed, including attention mechanism, ResNet, and U-shape networks. In the post-AlphaFold era, PSSP studies focus on different objectives, such as enhancing the quality of evolutionary information and exploiting protein language models as the PSSP input. The recent trend to utilize pre-trained language models as input features for secondary structure prediction provides a new direction for PSSP studies. Moreover, the state-of-the-art accuracy achieved by previous PSSP models is still below its theoretical limit. There are still rooms for improvement to be made in the field.
Collapse
Affiliation(s)
- Dewi Pramudi Ismi
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
- Department of Infomatics, Faculty of Industrial Technology, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
| | - Reza Pulungan
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| | - Afiahayati
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| |
Collapse
|
10
|
Jin X, Guo L, Jiang Q, Wu N, Yao S. Prediction of protein secondary structure based on an improved channel attention and multiscale convolution module. Front Bioeng Biotechnol 2022; 10:901018. [PMID: 35935483 PMCID: PMC9355137 DOI: 10.3389/fbioe.2022.901018] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 06/28/2022] [Indexed: 11/13/2022] Open
Abstract
Prediction of the protein secondary structure is a key issue in protein science. Protein secondary structure prediction (PSSP) aims to construct a function that can map the amino acid sequence into the secondary structure so that the protein secondary structure can be obtained according to the amino acid sequence. Driven by deep learning, the prediction accuracy of the protein secondary structure has been greatly improved in recent years. To explore a new technique of PSSP, this study introduces the concept of an adversarial game into the prediction of the secondary structure, and a conditional generative adversarial network (GAN)-based prediction model is proposed. We introduce a new multiscale convolution module and an improved channel attention (ICA) module into the generator to generate the secondary structure, and then a discriminator is designed to conflict with the generator to learn the complicated features of proteins. Then, we propose a PSSP method based on the proposed multiscale convolution module and ICA module. The experimental results indicate that the conditional GAN-based protein secondary structure prediction (CGAN-PSSP) model is workable and worthy of further study because of the strong feature-learning ability of adversarial learning.
Collapse
Affiliation(s)
- Xin Jin
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, Yunnan, China
- School of Software, Yunnan University, Kunming, Yunnan, China
| | - Lin Guo
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, Yunnan, China
- School of Software, Yunnan University, Kunming, Yunnan, China
| | - Qian Jiang
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, Yunnan, China
- School of Software, Yunnan University, Kunming, Yunnan, China
| | - Nan Wu
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, Yunnan, China
- School of Software, Yunnan University, Kunming, Yunnan, China
| | - Shaowen Yao
- Engineering Research Center of Cyberspace, Yunnan University, Kunming, Yunnan, China
- School of Software, Yunnan University, Kunming, Yunnan, China
| |
Collapse
|
11
|
Yang W, Liu Y, Xiao C. Deep metric learning for accurate protein secondary structure prediction. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
12
|
Contiguously hydrophobic sequences are functionally significant throughout the human exome. Proc Natl Acad Sci U S A 2022; 119:e2116267119. [PMID: 35294280 PMCID: PMC8944643 DOI: 10.1073/pnas.2116267119] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
SignificanceProteins rely on the hydrophobic effect to maintain structure and interactions with the environment. Surprisingly, natural selection on amino acid hydrophobicity has not been detected using modern genetic data. Analyses that treat each amino acid separately do not reveal significant results, which we confirm here. However, because the hydrophobic effect becomes more powerful as more hydrophobic molecules are introduced, we tested whether unbroken stretches of hydrophobic amino acids are under selection. Using genetic variant data from across the human genome, we find evidence that selection increases with the length of the unbroken hydrophobic sequence. These results could lead to improvements in a wide range of genomic tools as well as insights into protein-aggregation disease etiology and protein evolutionary history.
Collapse
|
13
|
Enireddy V, Karthikeyan C, Babu DV. OneHotEncoding and LSTM-based deep learning models for protein secondary structure prediction. Soft comput 2022. [DOI: 10.1007/s00500-022-06783-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
14
|
Protein secondary structure prediction using a lightweight convolutional network and label distribution aware margin loss. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107771] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
|
15
|
de Oliveira GB, Pedrini H, Dias Z. Ensemble of Template-Free and Template-Based Classifiers for Protein Secondary Structure Prediction. Int J Mol Sci 2021; 22:11449. [PMID: 34768880 PMCID: PMC8583764 DOI: 10.3390/ijms222111449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 10/18/2021] [Accepted: 10/20/2021] [Indexed: 11/16/2022] Open
Abstract
Protein secondary structures are important in many biological processes and applications. Due to advances in sequencing methods, there are many proteins sequenced, but fewer proteins with secondary structures defined by laboratory methods. With the development of computer technology, computational methods have (started to) become the most important methodologies for predicting secondary structures. We evaluated two different approaches to this problem-driven by the recent results obtained by computational methods in this task-(i) template-free classifiers, based on machine learning techniques; and (ii) template-based classifiers, based on searching tools. Both approaches are formed by different sub-classifiers-six for template-free and two for template-based, each with a specific view of the protein. Our results show that these ensembles improve the results of each approach individually.
Collapse
|
16
|
Caudai C, Galizia A, Geraci F, Le Pera L, Morea V, Salerno E, Via A, Colombo T. AI applications in functional genomics. Comput Struct Biotechnol J 2021; 19:5762-5790. [PMID: 34765093 PMCID: PMC8566780 DOI: 10.1016/j.csbj.2021.10.009] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 10/05/2021] [Accepted: 10/05/2021] [Indexed: 12/13/2022] Open
Abstract
We review the current applications of artificial intelligence (AI) in functional genomics. The recent explosion of AI follows the remarkable achievements made possible by "deep learning", along with a burst of "big data" that can meet its hunger. Biology is about to overthrow astronomy as the paradigmatic representative of big data producer. This has been made possible by huge advancements in the field of high throughput technologies, applied to determine how the individual components of a biological system work together to accomplish different processes. The disciplines contributing to this bulk of data are collectively known as functional genomics. They consist in studies of: i) the information contained in the DNA (genomics); ii) the modifications that DNA can reversibly undergo (epigenomics); iii) the RNA transcripts originated by a genome (transcriptomics); iv) the ensemble of chemical modifications decorating different types of RNA transcripts (epitranscriptomics); v) the products of protein-coding transcripts (proteomics); and vi) the small molecules produced from cell metabolism (metabolomics) present in an organism or system at a given time, in physiological or pathological conditions. After reviewing main applications of AI in functional genomics, we discuss important accompanying issues, including ethical, legal and economic issues and the importance of explainability.
Collapse
Affiliation(s)
- Claudia Caudai
- CNR, Institute of Information Science and Technologies “A. Faedo” (ISTI), Pisa, Italy
| | - Antonella Galizia
- CNR, Institute of Applied Mathematics and Information Technologies (IMATI), Genoa, Italy
| | - Filippo Geraci
- CNR, Institute for Informatics and Telematics (IIT), Pisa, Italy
| | - Loredana Le Pera
- CNR, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (IBIOM), Bari, Italy
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Veronica Morea
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Emanuele Salerno
- CNR, Institute of Information Science and Technologies “A. Faedo” (ISTI), Pisa, Italy
| | - Allegra Via
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Teresa Colombo
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| |
Collapse
|
17
|
Akbar S, Pardasani KR, Panda NR. PSO Based Neuro-fuzzy Model for Secondary Structure Prediction of Protein. Neural Process Lett 2021. [DOI: 10.1007/s11063-021-10615-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
18
|
Recent Advances in the Prediction of Protein Structural Classes: Feature Descriptors and Machine Learning Algorithms. CRYSTALS 2021. [DOI: 10.3390/cryst11040324] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
In the postgenomic age, rapid growth in the number of sequence-known proteins has been accompanied by much slower growth in the number of structure-known proteins (as a result of experimental limitations), and a widening gap between the two is evident. Because protein function is linked to protein structure, successful prediction of protein structure is of significant importance in protein function identification. Foreknowledge of protein structural class can help improve protein structure prediction with significant medical and pharmaceutical implications. Thus, a fast, suitable, reliable, and reasonable computational method for protein structural class prediction has become pivotal in bioinformatics. Here, we review recent efforts in protein structural class prediction from protein sequence, with particular attention paid to new feature descriptors, which extract information from protein sequence, and the use of machine learning algorithms in both feature selection and the construction of new classification models. These new feature descriptors include amino acid composition, sequence order, physicochemical properties, multiprofile Bayes, and secondary structure-based features. Machine learning methods, such as artificial neural networks (ANNs), support vector machine (SVM), K-nearest neighbor (KNN), random forest, deep learning, and examples of their application are discussed in detail. We also present our view on possible future directions, challenges, and opportunities for the applications of machine learning algorithms for prediction of protein structural classes.
Collapse
|
19
|
Guo L, Jiang Q, Jin X, Liu L, Zhou W, Yao S, Wu M, Wang Y. A Deep Convolutional Neural Network to Improve the Prediction of Protein Secondary Structure. Curr Bioinform 2020. [DOI: 10.2174/1574893615666200120103050] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Protein secondary structure prediction (PSSP) is a fundamental task in
bioinformatics that is helpful for understanding the three-dimensional structure and biological
function of proteins. Many neural network-based prediction methods have been developed for
protein secondary structures. Deep learning and multiple features are two obvious means to improve
prediction accuracy.
Objective:
To promote the development of PSSP, a deep convolutional neural network-based
method is proposed to predict both the eight-state and three-state of protein secondary structure.
Methods:
In this model, sequence and evolutionary information of proteins are combined as multiple
input features after preprocessing. A deep convolutional neural network with no pooling layer and
connection layer is then constructed to predict the secondary structure of proteins. L2 regularization,
batch normalization, and dropout techniques are employed to avoid over-fitting and obtain better
prediction performance, and an improved cross-entropy is used as the loss function.
Results:
Our proposed model can obtain Q3 prediction results of 86.2%, 84.5%, 87.8%, and 84.7%,
respectively, on CullPDB, CB513, CASP10 and CASP11 datasets, with corresponding Q8
prediction results of 74.1%, 70.5%, 74.9%, and 71.3%.
Conclusion:
We have proposed the DCNN-SS deep convolutional-network-based PSSP method,
and experimental results show that DCNN-SS performs competitively with other methods.
Collapse
Affiliation(s)
- Lin Guo
- School of Software, Yunnan University, Kunming, China; 2School of Information, Yunnan Normal University, Kunming, China
| | - Qian Jiang
- School of Software, Yunnan University, Kunming, China; 2School of Information, Yunnan Normal University, Kunming, China
| | - Xin Jin
- School of Software, Yunnan University, Kunming, China; 2School of Information, Yunnan Normal University, Kunming, China
| | - Lin Liu
- School of Software, Yunnan University, Kunming, China; 2School of Information, Yunnan Normal University, Kunming, China
| | - Wei Zhou
- School of Software, Yunnan University, Kunming, China; 2School of Information, Yunnan Normal University, Kunming, China
| | - Shaowen Yao
- School of Software, Yunnan University, Kunming, China; 2School of Information, Yunnan Normal University, Kunming, China
| | - Min Wu
- School of Software, Yunnan University, Kunming, China; 2School of Information, Yunnan Normal University, Kunming, China
| | - Yun Wang
- School of Software, Yunnan University, Kunming, China; 2School of Information, Yunnan Normal University, Kunming, China
| |
Collapse
|
20
|
An enhanced protein secondary structure prediction using deep learning framework on hybrid profile based features. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2019.105926] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
21
|
Abstract
BACKGROUND Signal peptides play an important role in protein sorting, which is the mechanism whereby proteins are transported to their destination. Recognition of signal peptides is an important first step in determining the active locations and functions of proteins. Many computational methods have been proposed to facilitate signal peptide recognition. In recent years, the development of deep learning methods has seen significant advances in many research fields. However, most existing models for signal peptide recognition use one-hidden-layer neural networks or hidden Markov models, which are relatively simple in comparison with the deep neural networks that are used in other fields. RESULTS This study proposes a convolutional neural network without fully connected layers, which is an important network improvement in computer vision. The proposed network is more complex in comparison with current signal peptide predictors. The experimental results show that the proposed network outperforms current signal peptide predictors on eukaryotic data. This study also demonstrates how model reduction and data augmentation helps the proposed network to predict bacterial data. CONCLUSIONS The study makes three contributions to this subject: (a) an accurate signal peptide recognizer is developed, (b) the potential to leverage advanced networks from other fields is demonstrated and (c) important modifications are proposed while adopting complex networks on signal peptide recognition.
Collapse
Affiliation(s)
- Jhe-Ming Wu
- Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan
| | - Yu-Chen Liu
- Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan
| | - Darby Tien-Hao Chang
- Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan.
| |
Collapse
|
22
|
Investigation of machine learning techniques on proteomics: A comprehensive survey. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2019; 149:54-69. [PMID: 31568792 DOI: 10.1016/j.pbiomolbio.2019.09.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Revised: 09/16/2019] [Accepted: 09/23/2019] [Indexed: 11/21/2022]
Abstract
Proteomics is the extensive investigation of proteins which has empowered the recognizable proof of consistently expanding quantities of protein. Proteins are necessary part of living life form, with numerous capacities. The proteome is the complete arrangement of proteins that are created or altered by a life form or framework of the organism. Proteome fluctuates with time and unambiguous prerequisites, or stresses, that a cell or organism experiences. Proteomics is an interdisciplinary area that has derived from the hereditary data of different genome ventures. Much proteomics information is gathered with the assistance of high throughput techniques, for example, mass spectrometry and microarray. It would regularly take weeks or months to analyze the information and perform examinations by hand. Therefore, scholars and scientific experts are teaming up with computer science researchers and mathematicians to make projects and pipeline to computationally examine the protein information. Utilizing bioinformatics procedures, scientists are prepared to do quicker investigation and protein information storing. The goal of this paper is to brief about the review of machine learning procedures and its application in the field of proteomics.
Collapse
|
23
|
Tsuchiya Y, Taneishi K, Yonezawa Y. Autoencoder-Based Detection of Dynamic Allostery Triggered by Ligand Binding Based on Molecular Dynamics. J Chem Inf Model 2019; 59:4043-4051. [PMID: 31386362 DOI: 10.1021/acs.jcim.9b00426] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Dynamic allostery on proteins, triggered by regulator binding or chemical modifications, transmits information from the binding site to distant regions, dramatically altering protein function. It is accompanied by subtle changes in side-chain conformations of the protein, indicating that the changes in dynamics, and not rigid or large conformational changes, are essential to understand regulation of protein function. Although a lot of experimental and theoretical studies have been dedicated to investigate this issue, the regulation mechanism of protein function is still being debated. Here, we propose an autoencoder-based method that can detect dynamic allostery. The method is based on the comparison of time fluctuations of protein structures, in the form of distance matrices, obtained from molecular dynamics simulations in ligand-bound and -unbound forms. Our method detected that the changes in dynamics by ligand binding in the PDZ2 domain led to the reorganization of correlative fluctuation motions among residue pairs, which revealed a different view of the correlated motions from the PCA and DCCM. In addition, other correlative motions were also found as a result of the dynamic perturbation from the ligand binding, which may lead to dynamic allostery. This autoencoder-based method would be usefully applied to the signal transduction and mutagenesis systems involved in protein functions and severe diseases.
Collapse
Affiliation(s)
- Yuko Tsuchiya
- Artificial Intelligence Research Center , National Institute of Advanced Industrial Science and Technology , 2-4-7 Aomi , Koto-ku , Tokyo 135-0064 , Japan
| | - Kei Taneishi
- Cluster for Science, Technology and Innovation Hub , RIKEN , 6-7-3 Minatojima-minamimachi , Chuo-ku, Kobe , Hyogo 650-0047 , Japan
| | - Yasushige Yonezawa
- High Pressure Protein Research Center, Institute of Advanced Technology , Kindai University , 930 Nishimitani , Kinokawa , Wakayama 649-6493 , Japan
| |
Collapse
|
24
|
rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments. PLoS One 2019; 14:e0220182. [PMID: 31415569 PMCID: PMC6695225 DOI: 10.1371/journal.pone.0220182] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2019] [Accepted: 07/10/2019] [Indexed: 12/01/2022] Open
Abstract
In the last decades, huge efforts have been made in the bioinformatics community to develop machine learning-based methods for the prediction of structural features of proteins in the hope of answering fundamental questions about the way proteins function and their involvement in several illnesses. The recent advent of Deep Learning has renewed the interest in neural networks, with dozens of methods being developed taking advantage of these new architectures. However, most methods are still heavily based pre-processing of the input data, as well as extraction and integration of multiple hand-picked, and manually designed features. Multiple Sequence Alignments (MSA) are the most common source of information in de novo prediction methods. Deep Networks that automatically refine the MSA and extract useful features from it would be immensely powerful. In this work, we propose a new paradigm for the prediction of protein structural features called rawMSA. The core idea behind rawMSA is borrowed from the field of natural language processing to map amino acid sequences into an adaptively learned continuous space. This allows the whole MSA to be input into a Deep Network, thus rendering pre-calculated features such as sequence profiles and other features calculated from MSA obsolete. We showcased the rawMSA methodology on three different prediction problems: secondary structure, relative solvent accessibility and inter-residue contact maps. We have rigorously trained and benchmarked rawMSA on a large set of proteins and have determined that it outperforms classical methods based on position-specific scoring matrices (PSSM) when predicting secondary structure and solvent accessibility, while performing on par with methods using more pre-calculated features in the inter-residue contact map prediction category in CASP12 and CASP13. Clearly demonstrating that rawMSA represents a promising development that can pave the way for improved methods using rawMSA instead of sequence profiles to represent evolutionary information in the coming years. Availability: datasets, dataset generation code, evaluation code and models are available at: https://bitbucket.org/clami66/rawmsa.
Collapse
|
25
|
Wardah W, Khan M, Sharma A, Rashid MA. Protein secondary structure prediction using neural networks and deep learning: A review. Comput Biol Chem 2019; 81:1-8. [DOI: 10.1016/j.compbiolchem.2019.107093] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Revised: 12/28/2018] [Accepted: 07/10/2019] [Indexed: 02/02/2023]
|
26
|
Wang X, Yu B, Ma A, Chen C, Liu B, Ma Q. Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics 2019; 35:2395-2402. [PMID: 30520961 PMCID: PMC6612859 DOI: 10.1093/bioinformatics/bty995] [Citation(s) in RCA: 93] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2018] [Revised: 11/19/2018] [Accepted: 12/03/2018] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The prediction of protein-protein interaction (PPI) sites is a key to mutation design, catalytic reaction and the reconstruction of PPI networks. It is a challenging task considering the significant abundant sequences and the imbalance issue in samples. RESULTS A new ensemble learning-based method, Ensemble Learning of synthetic minority oversampling technique (SMOTE) for Unbalancing samples and RF algorithm (EL-SMURF), was proposed for PPI sites prediction in this study. The sequence profile feature and the residue evolution rates were combined for feature extraction of neighboring residues using a sliding window, and the SMOTE was applied to oversample interface residues in the feature space for the imbalance problem. The Multi-dimensional Scaling feature selection method was implemented to reduce feature redundancy and subset selection. Finally, the Random Forest classifiers were applied to build the ensemble learning model, and the optimal feature vectors were inserted into EL-SMURF to predict PPI sites. The performance validation of EL-SMURF on two independent validation datasets showed 77.1% and 77.7% accuracy, which were 6.2-15.7% and 6.1-18.9% higher than the other existing tools, respectively. AVAILABILITY AND IMPLEMENTATION The source codes and data used in this study are publicly available at http://github.com/QUST-AIBBDRC/EL-SMURF/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaoying Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, China
- School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Anjun Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
- Department Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Cheng Chen
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, China
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, China
| | - Qin Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
- Department Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
27
|
Wu G, Zhang D, Chen W, Zuo W, Xia Z. Robust Deep Softmax Regression Against Label Noise for Unsupervised Domain Adaptation. INT J PATTERN RECOGN 2019. [DOI: 10.1142/s0218001419400020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Domain adaptation aims to generalize the classification model from a source domain to a different but related target domain. Recent studies have revealed the benefit of deep convolutional features trained on a large dataset (e.g. ImageNet) in alleviating domain discrepancy. However, literatures show that the transferability of features decreases as (i) the difference between the source and target domains increases, or (ii) the layers are toward the top layers. Therefore, even with deep features, domain adaptation remains necessary. In this paper, we propose a novel unsupervised domain adaptation (UDA) model for deep neural networks, which is learned with the labeled source samples and the unlabeled target ones simultaneously. For target samples without labels, pseudo labels are assigned to them according to their maximum classification scores during training of the UDA model. However, due to the domain discrepancy, label noise generally is inevitable, which degrades the performance of the domain adaptation model. Thus, to effectively utilize the target samples, three specific robust deep softmax regression (RDSR) functions are performed for them with high, medium and low classification confidence respectively. Extensive experiments show that our method yields the state-of-the-art results, demonstrating the effectiveness of the robust deep softmax regression classifier in UDA.
Collapse
Affiliation(s)
- Guangbin Wu
- State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, P. R. China
| | - David Zhang
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, P. R. China
| | - Weishan Chen
- State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, P. R. China
| | - Wangmeng Zuo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, P. R. China
| | - Zhuang Xia
- State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, P. R. China
| |
Collapse
|
28
|
Musil M, Konegger H, Hon J, Bednar D, Damborsky J. Computational Design of Stable and Soluble Biocatalysts. ACS Catal 2018. [DOI: 10.1021/acscatal.8b03613] [Citation(s) in RCA: 56] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Affiliation(s)
- Milos Musil
- Loschmidt Laboratories, Centre for Toxic Compounds in the Environment (RECETOX), and Department of Experimental Biology, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- IT4Innovations Centre of Excellence, Faculty of Information Technology, Brno University of Technology, 612 66 Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital, Pekarska 53, 656 91 Brno, Czech Republic
| | - Hannes Konegger
- Loschmidt Laboratories, Centre for Toxic Compounds in the Environment (RECETOX), and Department of Experimental Biology, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital, Pekarska 53, 656 91 Brno, Czech Republic
| | - Jiri Hon
- Loschmidt Laboratories, Centre for Toxic Compounds in the Environment (RECETOX), and Department of Experimental Biology, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- IT4Innovations Centre of Excellence, Faculty of Information Technology, Brno University of Technology, 612 66 Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital, Pekarska 53, 656 91 Brno, Czech Republic
| | - David Bednar
- Loschmidt Laboratories, Centre for Toxic Compounds in the Environment (RECETOX), and Department of Experimental Biology, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital, Pekarska 53, 656 91 Brno, Czech Republic
| | - Jiri Damborsky
- Loschmidt Laboratories, Centre for Toxic Compounds in the Environment (RECETOX), and Department of Experimental Biology, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital, Pekarska 53, 656 91 Brno, Czech Republic
| |
Collapse
|
29
|
Elbasir A, Moovarkumudalvan B, Kunji K, Kolatkar PR, Mall R, Bensmail H. DeepCrystal: a deep learning framework for sequence-based protein crystallization prediction. Bioinformatics 2018; 35:2216-2225. [DOI: 10.1093/bioinformatics/bty953] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Revised: 10/31/2018] [Accepted: 11/17/2018] [Indexed: 12/11/2022] Open
Abstract
Abstract
Motivation
Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico methods have been developed to predict crystallization propensities of proteins based on their sequences. However, the majority of these methods build their predictors by extracting features from protein sequences, which is computationally expensive and can explode the feature space. We propose DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction-quality crystals without the need to manually engineer additional biochemical and structural features from sequence. Our model is based on convolutional neural networks, which can exploit frequently occurring k-mers and sets of k-mers from the protein sequences to distinguish proteins that will result in diffraction-quality crystals from those that will not.
Results
Our model surpasses previous sequence-based protein crystallization predictors in terms of recall, F-score, accuracy and Matthew’s correlation coefficient (MCC) on three independent test sets. DeepCrystal achieves an average improvement of 1.4, 12.1% in recall, when compared to its closest competitors, Crysalis II and Crysf, respectively. In addition, DeepCrystal attains an average improvement of 2.1, 6.0% for F-score, 1.9, 3.9% for accuracy and 3.8, 7.0% for MCC w.r.t. Crysalis II and Crysf on independent test sets.
Availability and implementation
The standalone source code and models are available at https://github.com/elbasir/DeepCrystal and a web-server is also available at https://deeplearning-protein.qcri.org.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Abdurrahman Elbasir
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | | | - Khalid Kunji
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Prasanna R Kolatkar
- Qatar Biomedical Research Institute and Hamad Bin Khalifa University, Doha, Qatar
| | - Raghvendra Mall
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Halima Bensmail
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| |
Collapse
|
30
|
Panda B, Majhi B. A novel improved prediction of protein structural class using deep recurrent neural network. EVOLUTIONARY INTELLIGENCE 2018. [DOI: 10.1007/s12065-018-0171-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
31
|
Yu B, Li S, Qiu W, Wang M, Du J, Zhang Y, Chen X. Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 2018; 19:478. [PMID: 29914358 PMCID: PMC6006758 DOI: 10.1186/s12864-018-4849-9] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 06/01/2018] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Apoptosis is associated with some human diseases, including cancer, autoimmune disease, neurodegenerative disease and ischemic damage, etc. Apoptosis proteins subcellular localization information is very important for understanding the mechanism of programmed cell death and the development of drugs. Therefore, the prediction of subcellular localization of apoptosis protein is still a challenging task. RESULTS In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. Firstly, the protein sequences are extracted by combining pseudo-position specific scoring matrix (PsePSSM) and detrended cross-correlation analysis coefficient (DCCA coefficient), then the extracted feature information is reduced dimensionality by LFDA (local Fisher discriminant analysis). Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of the apoptosis proteins. The overall prediction accuracy of 99.7, 99.6 and 100% are achieved respectively on the three benchmark datasets by the most rigorous jackknife test, which is better than other state-of-the-art methods. CONCLUSION The experimental results indicate that our method can significantly improve the prediction accuracy of subcellular localization of apoptosis proteins, which is quite high to be able to become a promising tool for further proteomics studies. The source code and all datasets are available at https://github.com/QUST-BSBRC/PsePSSM-DCCA-LFDA/ .
Collapse
Affiliation(s)
- Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| | - Shan Li
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Wenying Qiu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Junwei Du
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, 264209, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 21116, China
| |
Collapse
|
32
|
Morshedian A, Razmara J, Lotfi S. A novel approach for protein structure prediction based on an estimation of distribution algorithm. Soft comput 2018. [DOI: 10.1007/s00500-018-3130-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
33
|
|
34
|
Lee SJ, Yun JP, Koo G, Kim SW. End-to-end recognition of slab identification numbers using a deep convolutional neural network. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2017.06.017] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|