1
|
Wang X, Zhou J, Mueller J, Quinn D, Carvalho A, Moody TS, Huang M. BioStructNet: Structure-Based Network with Transfer Learning for Predicting Biocatalyst Functions. J Chem Theory Comput 2025; 21:474-490. [PMID: 39705058 PMCID: PMC11736791 DOI: 10.1021/acs.jctc.4c01391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2024] [Revised: 12/03/2024] [Accepted: 12/12/2024] [Indexed: 12/21/2024]
Abstract
Enzyme-substrate interactions are essential to both biological processes and industrial applications. Advanced machine learning techniques have significantly accelerated biocatalysis research, revolutionizing the prediction of biocatalytic activities and facilitating the discovery of novel biocatalysts. However, the limited availability of data for specific enzyme functions, such as conversion efficiency and stereoselectivity, presents challenges for prediction accuracy. In this study, we developed BioStructNet, a structure-based deep learning network that integrates both protein and ligand structural data to capture the complexity of enzyme-substrate interactions. Benchmarking studies with different algorithms showed the enhanced predictive accuracy of BioStructNet. To further optimize the prediction accuracy for the small data set, we implemented transfer learning in the framework, training a source model on a large data set and fine-tuning it on a small, function-specific data set, using the CalB data set as a case study. The model performance was validated by comparing the attention heat maps generated by the BioStructNet interaction module with the enzyme-substrate interactions revealed from molecular dynamics simulations of enzyme-substrate complexes. BioStructNet would accelerate the discovery of functional enzymes for industrial use, particularly in cases where the training data sets for machine learning are small.
Collapse
Affiliation(s)
- Xiangwen Wang
- School
of Chemistry and Chemical Engineering, Queen’s
University Belfast, BT9 5AG Belfast, Northern Ireland, U.K.
- Department
of Biocatalysis and Isotope Chemistry, Almac
Sciences, BT63 5QD Craigavon, Northern
Ireland, U.K.
| | - Jiahui Zhou
- School
of Chemistry and Chemical Engineering, Queen’s
University Belfast, BT9 5AG Belfast, Northern Ireland, U.K.
| | - Jane Mueller
- Department
of Biocatalysis and Isotope Chemistry, Almac
Sciences, BT63 5QD Craigavon, Northern
Ireland, U.K.
| | - Derek Quinn
- Department
of Biocatalysis and Isotope Chemistry, Almac
Sciences, BT63 5QD Craigavon, Northern
Ireland, U.K.
| | - Alexandra Carvalho
- Department
of Biocatalysis and Isotope Chemistry, Almac
Sciences, BT63 5QD Craigavon, Northern
Ireland, U.K.
| | - Thomas S. Moody
- Department
of Biocatalysis and Isotope Chemistry, Almac
Sciences, BT63 5QD Craigavon, Northern
Ireland, U.K.
- Arran
Chemical Company Limited, Unit 1 Monksland Industrial Estate, Athlone, Co. Roscommon N37 DN24, Ireland
| | - Meilan Huang
- School
of Chemistry and Chemical Engineering, Queen’s
University Belfast, BT9 5AG Belfast, Northern Ireland, U.K.
| |
Collapse
|
2
|
Yao L, Guan J, Xie P, Chung CR, Zhao Z, Dong D, Guo Y, Zhang W, Deng J, Pang Y, Liu Y, Peng Y, Horng JT, Chiang YC, Lee TY. dbAMP 3.0: updated resource of antimicrobial activity and structural annotation of peptides in the post-pandemic era. Nucleic Acids Res 2025; 53:D364-D376. [PMID: 39540425 PMCID: PMC11701527 DOI: 10.1093/nar/gkae1019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Revised: 10/12/2024] [Accepted: 11/06/2024] [Indexed: 11/16/2024] Open
Abstract
Antimicrobial resistance is one of the most urgent global health threats, especially in the post-pandemic era. Antimicrobial peptides (AMPs) offer a promising alternative to traditional antibiotics, driving growing interest in recent years. dbAMP is a comprehensive database offering extensive annotations on AMPs, including sequence information, functional activity data, physicochemical properties and structural annotations. In this update, dbAMP has curated data from over 5200 publications, encompassing 33,065 AMPs and 2453 antimicrobial proteins from 3534 organisms. Additionally, dbAMP utilizes ESMFold to determine the three-dimensional structures of AMPs, providing over 30,000 structural annotations that facilitate structure-based functional insights for clinical drug development. Furthermore, dbAMP employs molecular docking techniques, providing over 100 docked complexes that contribute useful insights into the potential mechanisms of AMPs. The toxicity and stability of AMPs are critical factors in assessing their potential as clinical drugs. The updated dbAMP introduced an efficient tool for evaluating the hemolytic toxicity and half-life of AMPs, alongside an AMP optimization platform for designing AMPs with high antimicrobial activity, reduced toxicity and increased stability. The updated dbAMP is freely accessible at https://awi.cuhk.edu.cn/dbAMP/. Overall, dbAMP represents a comprehensive and essential resource for AMP analysis and design, poised to advance antimicrobial strategies in the post-pandemic era.
Collapse
Affiliation(s)
- Lantian Yao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Jiahui Guan
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Peilin Xie
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Chia-Ru Chung
- Department of Computer Science and Information Engineering, National Central University, 320317, Taoyuan, Taiwan
| | - Zhihao Zhao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Danhong Dong
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Yilin Guo
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Wenyang Zhang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Junyang Deng
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Yuxuan Pang
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, 108-8639, Tokyo, Japan
| | - Yulan Liu
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Yunlu Peng
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Jorng-Tzong Horng
- Department of Computer Science and Information Engineering, National Central University, 320317, Taoyuan, Taiwan
| | - Ying-Chih Chiang
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172, Shenzhen, China
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, 300093, Hsinchu, Taiwan
- Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, 300093, Hsinchu, Taiwan
| |
Collapse
|
3
|
Lin B, Luo X, Liu Y, Jin X. A comprehensive review and comparison of existing computational methods for protein function prediction. Brief Bioinform 2024; 25:bbae289. [PMID: 39003530 PMCID: PMC11246557 DOI: 10.1093/bib/bbae289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 05/18/2024] [Indexed: 07/15/2024] Open
Abstract
Protein function prediction is critical for understanding the cellular physiological and biochemical processes, and it opens up new possibilities for advancements in fields such as disease research and drug discovery. During the past decades, with the exponential growth of protein sequence data, many computational methods for predicting protein function have been proposed. Therefore, a systematic review and comparison of these methods are necessary. In this study, we divide these methods into four different categories, including sequence-based methods, 3D structure-based methods, PPI network-based methods and hybrid information-based methods. Furthermore, their advantages and disadvantages are discussed, and then their performance is comprehensively evaluated and compared. Finally, we discuss the challenges and opportunities present in this field.
Collapse
Affiliation(s)
- Baohui Lin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong 518118, China
| | - Xiaoling Luo
- Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Shenzhen, Guangdong, China
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518061, China
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong 518118, China
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong 518118, China
| |
Collapse
|
4
|
Kumar N, Srivastava R. Deep learning in structural bioinformatics: current applications and future perspectives. Brief Bioinform 2024; 25:bbae042. [PMID: 38701422 PMCID: PMC11066934 DOI: 10.1093/bib/bbae042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 01/05/2024] [Accepted: 01/18/2024] [Indexed: 05/05/2024] Open
Abstract
In this review article, we explore the transformative impact of deep learning (DL) on structural bioinformatics, emphasizing its pivotal role in a scientific revolution driven by extensive data, accessible toolkits and robust computing resources. As big data continue to advance, DL is poised to become an integral component in healthcare and biology, revolutionizing analytical processes. Our comprehensive review provides detailed insights into DL, featuring specific demonstrations of its notable applications in bioinformatics. We address challenges tailored for DL, spotlight recent successes in structural bioinformatics and present a clear exposition of DL-from basic shallow neural networks to advanced models such as convolution, recurrent, artificial and transformer neural networks. This paper discusses the emerging use of DL for understanding biomolecular structures, anticipating ongoing developments and applications in the realm of structural bioinformatics.
Collapse
Affiliation(s)
- Niranjan Kumar
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Rakesh Srivastava
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| |
Collapse
|
5
|
Wei Q, Wang R, Jiang Y, Wei L, Sun Y, Geng J, Su R. ConPep: Prediction of peptide contact maps with pre-trained biological language model and multi-view feature extracting strategy. Comput Biol Med 2023; 167:107631. [PMID: 37948966 DOI: 10.1016/j.compbiomed.2023.107631] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 10/16/2023] [Accepted: 10/23/2023] [Indexed: 11/12/2023]
Abstract
The accurate prediction of peptide contact maps remains a challenging task due to the difficulty in obtaining the interactive information between residues on short sequences. To address this challenge, we propose ConPep, a deep learning framework designed for predicting the contact map of peptides based on sequences only. To sufficiently incorporate the sequential semantic information between residues in peptide sequences, we use a pre-trained biological language model and transfer prior knowledge from large scale databases. Additionally, to extract and integrate sequential local information and residue-based global correlations, our model incorporates Bidirectional Gated Recurrent Unit and attention mechanisms. They can obtain multi-view features and thus enhance the accuracy and robustness of our prediction. Comparative results on independent tests demonstrate that our proposed method significantly outperforms state-of-the-art methods even with short peptides. Notably, our method exhibits superior performance at the sequence level, suggesting the robust ability of our model compared with the multiple sequence alignment (MSA) analysis-based methods. We expect it can be meaningful research for facilitating the wide use of our method.
Collapse
Affiliation(s)
- Qingxin Wei
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Ruheng Wang
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Yi Jiang
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China; Centre for Artificial Intelligence driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao SAR, China
| | - Yu Sun
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Jie Geng
- Department of Cardiology, Tianjin Chest Hospital, Tianjin, China.
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
6
|
Stahl K, Graziadei A, Dau T, Brock O, Rappsilber J. Protein structure prediction with in-cell photo-crosslinking mass spectrometry and deep learning. Nat Biotechnol 2023; 41:1810-1819. [PMID: 36941363 PMCID: PMC10713450 DOI: 10.1038/s41587-023-01704-z] [Citation(s) in RCA: 69] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 02/06/2023] [Indexed: 03/23/2023]
Abstract
While AlphaFold2 can predict accurate protein structures from the primary sequence, challenges remain for proteins that undergo conformational changes or for which few homologous sequences are known. Here we introduce AlphaLink, a modified version of the AlphaFold2 algorithm that incorporates experimental distance restraint information into its network architecture. By employing sparse experimental contacts as anchor points, AlphaLink improves on the performance of AlphaFold2 in predicting challenging targets. We confirm this experimentally by using the noncanonical amino acid photo-leucine to obtain information on residue-residue contacts inside cells by crosslinking mass spectrometry. The program can predict distinct conformations of proteins on the basis of the distance restraints provided, demonstrating the value of experimental data in driving protein structure prediction. The noise-tolerant framework for integrating data in protein structure prediction presented here opens a path to accurate characterization of protein structures from in-cell data.
Collapse
Affiliation(s)
- Kolja Stahl
- Robotics and Biology Laboratory, Technische Universität Berlin, Berlin, Germany
| | - Andrea Graziadei
- Technische Universität Berlin, Chair of Bioanalytics, Berlin, Germany
| | - Therese Dau
- Technische Universität Berlin, Chair of Bioanalytics, Berlin, Germany
- Fritz Lipmann Institute, Leibniz Institute on Aging, Jena, Germany
| | - Oliver Brock
- Robotics and Biology Laboratory, Technische Universität Berlin, Berlin, Germany.
- Science of Intelligence, Research Cluster of Excellence, Berlin, Germany.
| | - Juri Rappsilber
- Technische Universität Berlin, Chair of Bioanalytics, Berlin, Germany.
- Si-M/'Der Simulierte Mensch', a Science Framework of Technische Universität Berlin and Charité - Universitätsmedizin Berlin, Berlin, Germany.
- Wellcome Centre for Cell Biology, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
7
|
Huang B, Kong L, Wang C, Ju F, Zhang Q, Zhu J, Gong T, Zhang H, Yu C, Zheng WM, Bu D. Protein Structure Prediction: Challenges, Advances, and the Shift of Research Paradigms. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:913-925. [PMID: 37001856 PMCID: PMC10928435 DOI: 10.1016/j.gpb.2022.11.014] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 11/23/2022] [Accepted: 11/30/2022] [Indexed: 03/31/2023]
Abstract
Protein structure prediction is an interdisciplinary research topic that has attracted researchers from multiple fields, including biochemistry, medicine, physics, mathematics, and computer science. These researchers adopt various research paradigms to attack the same structure prediction problem: biochemists and physicists attempt to reveal the principles governing protein folding; mathematicians, especially statisticians, usually start from assuming a probability distribution of protein structures given a target sequence and then find the most likely structure, while computer scientists formulate protein structure prediction as an optimization problem - finding the structural conformation with the lowest energy or minimizing the difference between predicted structure and native structure. These research paradigms fall into the two statistical modeling cultures proposed by Leo Breiman, namely, data modeling and algorithmic modeling. Recently, we have also witnessed the great success of deep learning in protein structure prediction. In this review, we present a survey of the efforts for protein structure prediction. We compare the research paradigms adopted by researchers from different fields, with an emphasis on the shift of research paradigms in the era of deep learning. In short, the algorithmic modeling techniques, especially deep neural networks, have considerably improved the accuracy of protein structure prediction; however, theories interpreting the neural networks and knowledge on protein folding are still highly desired.
Collapse
Affiliation(s)
- Bin Huang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lupeng Kong
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; Changping Laboratory, Beijing 102206, China
| | - Chao Wang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Fusong Ju
- Microsoft Research AI4Science, Beijing 100080, China
| | - Qi Zhang
- Huawei Noah's Ark Lab, Wuhan 430206, China
| | - Jianwei Zhu
- Microsoft Research AI4Science, Beijing 100080, China
| | - Tiansu Gong
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Haicang Zhang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| | - Chungong Yu
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| | - Wei-Mou Zheng
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China.
| | - Dongbo Bu
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| |
Collapse
|
8
|
Hu J, Zhao Y, Li Q, Song Y, Dong R, Yang W, Siriwardane EM. Deep Learning-Based Prediction of Contact Maps and Crystal Structures of Inorganic Materials. ACS OMEGA 2023; 8:26170-26179. [PMID: 37521616 PMCID: PMC10373470 DOI: 10.1021/acsomega.3c02115] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Accepted: 06/28/2023] [Indexed: 08/01/2023]
Abstract
Crystal structure prediction is one of the major unsolved problems in materials science. Traditionally, this problem is formulated as a global optimization problem for which global search algorithms are combined with first-principles free energy calculations to predict the ground-state crystal structure of a given material composition. These ab initio algorithms are currently too slow for predicting complex material structures. Inspired by the AlphaFold algorithm for protein structure prediction, herein, we propose AlphaCrystal, a crystal structure prediction algorithm that combines a deep residual neural network model for predicting the atomic contact map of a target material followed by three-dimensional (3D) structure reconstruction using genetic algorithms. Extensive experiments on 20 benchmark structures showed that our AlphaCrystal algorithm can predict structures close to the ground truth structures, which can significantly speed up the crystal structure prediction and handle relatively large systems.
Collapse
Affiliation(s)
- Jianjun Hu
- Department
of Computer Science and Engineering, University
of South Carolna, Columbia, South Carolina 29201, United States
| | - Yong Zhao
- Department
of Computer Science and Engineering, University
of South Carolna, Columbia, South Carolina 29201, United States
| | - Qin Li
- College
of Big Data and Statistics, Guizhou University
of Finance and Economics, Guiyang 550050, China
| | - Yuqi Song
- Department
of Computer Science and Engineering, University
of South Carolna, Columbia, South Carolina 29201, United States
| | - Rongzhi Dong
- Department
of Computer Science and Engineering, University
of South Carolna, Columbia, South Carolina 29201, United States
| | - Wenhui Yang
- School
of Mechanical Engineering, Guizhou University, Guiyang 550050, China
| | | |
Collapse
|
9
|
Manzoori S, Farahani AHK, Moradi MH, Kazemi-Bonchenari M. Detecting SNP markers discriminating horse breeds by deep learning. Sci Rep 2023; 13:11592. [PMID: 37464049 DOI: 10.1038/s41598-023-38601-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Accepted: 07/11/2023] [Indexed: 07/20/2023] Open
Abstract
The assignment of an individual to the true population of origin using a low-panel of discriminant SNP markers is one of the most important applications of genomic data for practical use. The aim of this study was to evaluate the potential of different Artificial Neural Networks (ANNs) approaches consisting Deep Neural Networks (DNN), Garson and Olden methods for feature selection of informative SNP markers from high-throughput genotyping data, that would be able to trace the true breed of unknown samples. The total of 795 animals from 37 breeds, genotyped by using the Illumina SNP 50k Bead chip were used in the current study and principal component analysis (PCA), log-likelihood ratios (LLR) and Neighbor-Joining (NJ) were applied to assess the performance of different assignment methods. The results revealed that the DNN, Garson, and Olden methods are able to assign individuals to true populations with 4270, 4937, and 7999 SNP markers, respectively. The PCA was used to determine how the animals allocated to the groups using all genotyped markers available on 50k Bead chip and the subset of SNP markers identified with different methods. The results indicated that all SNP panels are able to assign individuals into their true breeds. The success percentage of genetic assignment for different methods assessed by different levels of LLR showed that the success rate of 70% in the analysis was obtained by three methods with the number of markers of 110, 208, and 178 tags for DNN, Garson, and Olden methods, respectively. Also the results showed that DNN performed better than other two approaches by achieving 93% accuracy at the most stringent threshold. Finally, the identified SNPs were successfully used in independent out-group breeds consisting 120 individuals from eight breeds and the results indicated that these markers are able to correctly allocate all unknown samples to true population of origin. Furthermore, the NJ tree of allele-sharing distances on the validation dataset showed that the DNN has a high potential for feature selection. In general, the results of this study indicated that the DNN technique represents an efficient strategy for selecting a reduced pool of highly discriminant markers for assigning individuals to the true population of origin.
Collapse
Affiliation(s)
- Siavash Manzoori
- Department of Animal Science, Faculty of Agriculture and Natural Resources, Arak University, Arak, Iran
| | | | - Mohammad Hossein Moradi
- Department of Animal Science, Faculty of Agriculture and Natural Resources, Arak University, Arak, Iran
| | - Mehdi Kazemi-Bonchenari
- Department of Animal Science, Faculty of Agriculture and Natural Resources, Arak University, Arak, Iran
| |
Collapse
|
10
|
Ibtehaz N, Sourav SMSH, Bayzid MS, Rahman MS. Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 2023; 42:135-146. [PMID: 36977849 DOI: 10.1007/s10930-023-10096-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2023] [Indexed: 03/29/2023]
Abstract
The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
Collapse
|
11
|
Mufassirin MMM, Newton MAH, Sattar A. Artificial intelligence for template-free protein structure prediction: a comprehensive review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10350-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
12
|
Barger J, Adhikari B. New Labeling Methods for Deep Learning Real-Valued Inter-Residue Distance Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3586-3594. [PMID: 34559660 DOI: 10.1109/tcbb.2021.3115053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
BACKGROUND Much of the recent success in protein structure prediction has been a result of accurate protein contact prediction-a binary classification problem. Dozens of methods, built from various types of machine learning and deep learning algorithms, have been published over the last two decades for predicting contacts. Recently, many groups, including Google DeepMind, have demonstrated that reformulating the problem as a multi-class classification problem is a more promising direction to pursue. As an alternative approach, we recently proposed real-valued distance predictions, formulating the problem as a regression problem. The nuances of protein 3D structures make this formulation appropriate, allowing predictions to reflect inter-residue distances in nature. Despite these promises, the accurate prediction of real-valued distances remains relatively unexplored; possibly due to classification being better suited to machine and deep learning algorithms. METHODS Can regression methods be designed to predict real-valued distances as precise as binary contacts? To investigate this, we propose multiple novel methods of input label engineering, which is different from feature engineering, with the goal of optimizing the distribution of distances to cater to the loss function of the deep-learning model. Since an important utility of predicted contacts or distances is to build three-dimensional models, we also tested if predicted distances can reconstruct more accurate models than contacts. RESULTS Our results demonstrate, for the first time, that deep learning methods for real-valued protein distance prediction can deliver distances as precise as binary classification methods. When using an optimal distance transformation function on the standard PSICOV dataset consisting of 150 representative proteins, the precision of 'top-all' long-range contacts improves from 60.9% to 61.4% when predicting real-valued distances instead of contacts. When building three-dimensional models we observed an average TM-score increase from 0.61 to 0.72, highlighting the advantage of predicting real-valued distances.
Collapse
|
13
|
DLm6Am: A Deep-Learning-Based Tool for Identifying N6,2′-O-Dimethyladenosine Sites in RNA Sequences. Int J Mol Sci 2022; 23:ijms231911026. [PMID: 36232325 PMCID: PMC9570463 DOI: 10.3390/ijms231911026] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Revised: 09/10/2022] [Accepted: 09/15/2022] [Indexed: 11/25/2022] Open
Abstract
N6,2′-O-dimethyladenosine (m6Am) is a post-transcriptional modification that may be associated with regulatory roles in the control of cellular functions. Therefore, it is crucial to accurately identify transcriptome-wide m6Am sites to understand underlying m6Am-dependent mRNA regulation mechanisms and biological functions. Here, we used three sequence-based feature-encoding schemes, including one-hot, nucleotide chemical property (NCP), and nucleotide density (ND), to represent RNA sequence samples. Additionally, we proposed an ensemble deep learning framework, named DLm6Am, to identify m6Am sites. DLm6Am consists of three similar base classifiers, each of which contains a multi-head attention module, an embedding module with two parallel deep learning sub-modules, a convolutional neural network (CNN) and a Bi-directional long short-term memory (BiLSTM), and a prediction module. To demonstrate the superior performance of our model’s architecture, we compared multiple model frameworks with our method by analyzing the training data and independent testing data. Additionally, we compared our model with the existing state-of-the-art computational methods, m6AmPred and MultiRM. The accuracy (ACC) for the DLm6Am model was improved by 6.45% and 8.42% compared to that of m6AmPred and MultiRM on independent testing data, respectively, while the area under receiver operating characteristic curve (AUROC) for the DLm6Am model was increased by 4.28% and 5.75%, respectively. All the results indicate that DLm6Am achieved the best prediction performance in terms of ACC, Matthews correlation coefficient (MCC), AUROC, and the area under precision and recall curves (AUPR). To further assess the generalization performance of our proposed model, we implemented chromosome-level leave-out cross-validation, and found that the obtained AUROC values were greater than 0.83, indicating that our proposed method is robust and can accurately predict m6Am sites.
Collapse
|
14
|
Gu J, Zhang T, Wu C, Liang Y, Shi X. Refined Contact Map Prediction of Peptides Based on GCN and ResNet. Front Genet 2022; 13:859626. [PMID: 35571037 PMCID: PMC9092020 DOI: 10.3389/fgene.2022.859626] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 03/23/2022] [Indexed: 11/13/2022] Open
Abstract
Predicting peptide inter-residue contact maps plays an important role in computational biology, which determines the topology of the peptide structure. However, due to the limited number of known homologous structures, there is still much room for inter-residue contact map prediction. Current models are not sufficient for capturing the high accuracy relationship between the residues, especially for those with a long-range distance. In this article, we developed a novel deep neural network framework to refine the rough contact map produced by the existing methods. The rough contact map is used to construct the residue graph that is processed by the graph convolutional neural network (GCN). GCN can better capture the global information and is therefore used to grasp the long-range contact relationship. The residual convolutional neural network is also applied in the framework for learning local information. We conducted the experiments on four different test datasets, and the inter-residue long-range contact map prediction accuracy demonstrates the effectiveness of our proposed method.
Collapse
Affiliation(s)
- Jiawei Gu
- College of Computer Science and Technology, University of Jilin, Changchun, China
| | - Tianhao Zhang
- College of Computer Science and Technology, University of Jilin, Changchun, China
| | - Chunguo Wu
- College of Computer Science and Technology, University of Jilin, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Changchun, China
| | - Yanchun Liang
- College of Computer Science and Technology, University of Jilin, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Changchun, China
- School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China
| | - Xiaohu Shi
- College of Computer Science and Technology, University of Jilin, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Changchun, China
- School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China
- *Correspondence: Xiaohu Shi,
| |
Collapse
|
15
|
Santra S, Jana M. Predicting the evolution of number of native contacts of a small protein by using deep learning approach. Comput Biol Chem 2022; 97:107625. [DOI: 10.1016/j.compbiolchem.2022.107625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 01/07/2022] [Accepted: 01/09/2022] [Indexed: 11/28/2022]
|
16
|
Lee D, Xiong D, Wierbowski S, Li L, Liang S, Yu H. Deep learning methods for 3D structural proteome and interactome modeling. Curr Opin Struct Biol 2022; 73:102329. [PMID: 35139457 PMCID: PMC8957610 DOI: 10.1016/j.sbi.2022.102329] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 12/05/2021] [Accepted: 12/31/2021] [Indexed: 12/19/2022]
Abstract
Bolstered by recent methodological and hardware advances, deep learning has increasingly been applied to biological problems and structural proteomics. Such approaches have achieved remarkable improvements over traditional machine learning methods in tasks ranging from protein contact map prediction to protein folding, prediction of protein-protein interaction interfaces, and characterization of protein-drug binding pockets. In particular, emergence of ab initio protein structure prediction methods including AlphaFold2 has revolutionized protein structural modeling. From a protein function perspective, numerous deep learning methods have facilitated deconvolution of the exact amino acid residues and protein surface regions responsible for binding other proteins or small molecule drugs. In this review, we provide a comprehensive overview of recent deep learning methods applied in structural proteomics.
Collapse
Affiliation(s)
- Dongjin Lee
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Dapeng Xiong
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Shayne Wierbowski
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Le Li
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Siqi Liang
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Haiyuan Yu
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA.
| |
Collapse
|
17
|
Abstract
Compensatory substitutions happen when one mutation is advantageously selected because it restores the loss of fitness induced by a previous deleterious mutation. How frequent such mutations occur in evolution and what is the structural and functional context permitting their emergence remain open questions. We built an atlas of intra-protein compensatory substitutions using a phylogenetic approach and a dataset of 1,630 bacterial protein families for which high-quality sequence alignments and experimentally derived protein structures were available. We identified more than 51,000 positions coevolving by the mean of predicted compensatory mutations. Using the evolutionary and structural properties of the analyzed positions, we demonstrate that compensatory mutations are scarce (typically only a few in the protein history) but widespread (the majority of proteins experienced at least one). Typical coevolving residues are evolving slowly, are located in the protein core outside secondary structure motifs, and are more often in contact than expected by chance, even after accounting for their evolutionary rate and solvent exposure. An exception to this general scheme is residues coevolving for charge compensation, which are evolving faster than noncoevolving sites, in contradiction with predictions from simple coevolutionary models, but similar to stem pairs in RNA. While sites with a significant pattern of coevolution by compensatory mutations are rare, the comparative analysis of hundreds of structures ultimately permits a better understanding of the link between the three-dimensional structure of a protein and its fitness landscape.
Collapse
Affiliation(s)
- Shilpi Chaurasia
- RG Molecular Systems Evolution, Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, August-Thienemann-Straße 2, 24306 Plön, Germany.,Excelra Knowledge Solutions Pvt Ltd, Hyderabad, India
| | - Julien Y Dutheil
- RG Molecular Systems Evolution, Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, August-Thienemann-Straße 2, 24306 Plön, Germany.,Institute of Evolution Sciences of Montpellier (ISEM), CNRS, University of Montpellier, IRD, EPHE, 34095 Montpellier, France
| |
Collapse
|
18
|
Yu CH, Chen W, Chiang YH, Guo K, Martin Moldes Z, Kaplan DL, Buehler MJ. End-to-End Deep Learning Model to Predict and Design Secondary Structure Content of Structural Proteins. ACS Biomater Sci Eng 2022; 8:1156-1165. [PMID: 35129957 PMCID: PMC9347213 DOI: 10.1021/acsbiomaterials.1c01343] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Structural proteins are the basis of many biomaterials and key construction and functional components of all life. Further, it is well-known that the diversity of proteins' function relies on their local structures derived from their primary amino acid sequences. Here, we report a deep learning model to predict the secondary structure content of proteins directly from primary sequences, with high computational efficiency. Understanding the secondary structure content of proteins is crucial to designing proteins with targeted material functions, especially mechanical properties. Using convolutional and recurrent architectures and natural language models, our deep learning model predicts the content of two essential types of secondary structures, the α-helix and the β-sheet. The training data are collected from the Protein Data Bank and contain many existing protein geometries. We find that our model can learn the hidden features as patterns of input sequences that can then be directly related to secondary structure content. The α-helix and β-sheet content predictions show excellent agreement with training data and newly deposited protein structures that were recently identified and that were not included in the original training set. We further demonstrate the features of the model by a search for de novo protein sequences that optimize max/min α-helix/β-sheet content and compare the predictions with folded models of these sequences based on AlphaFold2. Excellent agreement is found, underscoring that our model has predictive potential for rapidly designing proteins with specific secondary structures and could be widely applied to biomedical industries, including protein biomaterial designs and regenerative medicine applications.
Collapse
Affiliation(s)
- Chi-Hua Yu
- Laboratory for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.,Department of Engineering Science, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan
| | - Wei Chen
- Department of Engineering Science, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan
| | - Yu-Hsuan Chiang
- Department of Civil Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan
| | - Kai Guo
- Laboratory for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Zaira Martin Moldes
- Department of Biomedical Engineering, Tufts University, Medford, Massachusetts 02155, United States
| | - David L Kaplan
- Department of Biomedical Engineering, Tufts University, Medford, Massachusetts 02155, United States
| | - Markus J Buehler
- Laboratory for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.,Center for Computational Science and Engineering, Schwarzman College of Computing, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.,Center for Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
19
|
Recognition of mRNA N4 Acetylcytidine (ac4C) by Using Non-Deep vs. Deep Learning. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12031344] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Deep learning models have been successfully applied in a wide range of fields. The creation of a deep learning framework for analyzing high-performance sequence data have piqued the research community’s interest. N4 acetylcytidine (ac4C) is a post-transcriptional modification in mRNA, is an mRNA component that plays an important role in mRNA stability control and translation. The ac4C method of mRNA changes is still not simple, time consuming, or cost effective for conventional laboratory experiments. As a result, we developed DL-ac4C, a CNN-based deep learning model for ac4C recognition. In the alternative scenario, the model families are well-suited to working in large datasets with a large number of available samples, especially in biological domains. In this study, the DL-ac4C method (deep learning) is compared to non-deep learning (machine learning) methods, regression, and support vector machine. The results show that DL-ac4C is more advanced than previously used approaches. The proposed model improves the accuracy recall area by 9.6 percent and 9.8 percent, respectively, for cross-validation and independent tests. More nuanced methods of incorporating prior bio-logical knowledge into the estimation procedure of deep learning models are required to achieve better results in terms of predictive efficiency and cost-effectiveness. Based on an experiment’s acetylated dataset, the DL-ac4C sequence-based predictor for acetylation sites in mRNA can predict whether query sequences have potential acetylation motifs.
Collapse
|
20
|
Zhu F, Li F, Deng L, Meng F, Liang Z. Protein Interaction Network Reconstruction with a Structural Gated Attention Deep Model by Incorporating Network Structure Information. J Chem Inf Model 2022; 62:258-273. [PMID: 35005980 DOI: 10.1021/acs.jcim.1c00982] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Protein-protein interactions (PPIs) provide a physical basis of molecular communications for a wide range of biological processes in living cells. Establishing the PPI network has become a fundamental but essential task for a better understanding of biological events and disease pathogenesis. Although many machine learning algorithms have been employed to predict PPIs, with only protein sequence information as the training features, these models suffer from low robustness and prediction accuracy. In this study, a new deep-learning-based framework named the Structural Gated Attention Deep (SGAD) model was proposed to improve the performance of PPI network reconstruction (PINR). The improved predictive performances were achieved by augmenting multiple protein sequence descriptors, the topological features and information flow of the PPI network, which were further implemented with a gating mechanism to improve its robustness to noise. On 11 independent test data sets and one combined data set, SGAD yielded area under the curve values of approximately 0.83-0.93, outperforming other models. Furthermore, the SGAD ensemble can learn more characteristics information on protein pairs through a two-layer neural network, serving as a powerful tool in the exploration of PPI biological space.
Collapse
Affiliation(s)
- Fei Zhu
- School of Computer Science and Technology, Soochow University, Suzhou 215 006, China
| | - Feifei Li
- School of Computer Science and Technology, Soochow University, Suzhou 215 006, China
| | - Lei Deng
- School of Computer Science and Technology, Soochow University, Suzhou 215 006, China
| | - Fanwang Meng
- Department of Chemistry and Chemical Biology, McMaster University, Hamilton, Ontario L8S 4L8, Canada
| | - Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215 006, China
| |
Collapse
|
21
|
Sharma A, Kumar R, Semwal R, Aier I, Tyagi P, Varadwaj PK. DeepOlf: Deep Neural Network Based Architecture for Predicting Odorants and Their Interacting Olfactory Receptors. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:418-428. [PMID: 32750862 DOI: 10.1109/tcbb.2020.3002154] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Olfaction transduction mechanism is triggered by the binding of odorants to the specific olfactory receptors (OR's) present in the nasal cavity. Different odorants stimulate different OR's due to the difference in shape, physical and chemical properties. In this paper, a deep neural network architecture DeepOlf, based on molecular features and fingerprints of odorants and ORs, to predict whether a chemical compound is a potential odorant or not along with its interacting OR is proposed. Odorant identification and Odorant-OR interaction were modeled as a binary classification through multiple classifiers. The evaluation of these classifier's performance showed that the deep-neural network framework not only fits data with better accuracy in comparison to other classical methods (SVM, RF, k-NN) but also able to predict odorant-OR interactions more accurately. To our knowledge, this study is the first realization of deep learning ideas for the problem of odorant and interacting OR prediction. The accuracy of DeepOlf was found to be 94.83 and 99.92 percent for the prediction of odorants and Odorant- OR interactions respectively. Comparison of DeepOlf prediction with the existing SVM based prediction server, ODORactor, showed that better performance can be achieved with the proposed deep learning approach. The DeepOlf tool can be accessed at https://bioserver.iiita.ac.in/deepolf/.
Collapse
|
22
|
Kumar P, Hati AS. Transfer learning-based deep CNN model for multiple faults detection in SCIM. Neural Comput Appl 2021. [DOI: 10.1007/s00521-021-06205-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
23
|
Kim CA, Ricke ND, Van Voorhis T. Machine learning dynamic correlation in chemical kinetics. J Chem Phys 2021; 155:144107. [PMID: 34654306 DOI: 10.1063/5.0065874] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Lattice models are a useful tool to simulate the kinetics of surface reactions. Since it is expensive to propagate the probabilities of the entire lattice configurations, it is practical to consider the occupation probabilities of a typical site or a cluster of sites instead. This amounts to a moment closure approximation of the chemical master equation. Unfortunately, simple closures, such as the mean-field and the pair approximation (PA), exhibit weaknesses in systems with significant long-range correlation. In this paper, we show that machine learning (ML) can be used to construct accurate moment closures in chemical kinetics using the lattice Lotka-Volterra model as a model system. We trained feedforward neural networks on kinetic Monte Carlo (KMC) results at select values of rate constants and initial conditions. Given the same level of input as PA, the ML moment closure (MLMC) gave accurate predictions of the instantaneous three-site occupation probabilities. Solving the kinetic equations in conjunction with MLMC gave drastic improvements in the simulated dynamics and descriptions of the dynamical regimes throughout the parameter space. In this way, MLMC is a promising tool to interpolate KMC simulations or construct pretrained closures that would enable researchers to extract useful insight at a fraction of the computational cost.
Collapse
Affiliation(s)
- Changhae Andrew Kim
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Nathan D Ricke
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Troy Van Voorhis
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
24
|
Wei H, Zhao Z, Luo R. Machine-Learned Molecular Surface and Its Application to Implicit Solvent Simulations. J Chem Theory Comput 2021; 17:6214-6224. [PMID: 34516109 DOI: 10.1021/acs.jctc.1c00492] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Implicit solvent models, such as Poisson-Boltzmann models, play important roles in computational studies of biomolecules. A vital step in almost all implicit solvent models is to determine the solvent-solute interface, and the solvent excluded surface (SES) is the most widely used interface definition in these models. However, classical algorithms used for computing SES are geometry-based, so that they are neither suitable for parallel implementations nor convenient for obtaining surface derivatives. To address the limitations, we explored a machine learning strategy to obtain a level set formulation for the SES. The training process was conducted in three steps, eventually leading to a model with over 95% agreement with the classical SES. Visualization of tested molecular surfaces shows that the machine-learned SES overlaps with the classical SES in almost all situations. Further analyses show that the machine-learned SES is incredibly stable in terms of rotational variation of tested molecules. Our timing analysis shows that the machine-learned SES is roughly 2.5 times as efficient as the classical SES routine implemented in Amber/PBSA on a tested central processing unit (CPU) platform. We expect further performance gain on massively parallel platforms such as graphics processing units (GPUs) given the ease in converting the machine-learned SES to a parallel procedure. We also implemented the machine-learned SES into the Amber/PBSA program to study its performance on reaction field energy calculation. The analysis shows that the two sets of reaction field energies are highly consistent with a 1% deviation on average. Given its level set formulation, we expect the machine-learned SES to be applied in molecular simulations that require either surface derivatives or high efficiency on parallel computing platforms.
Collapse
Affiliation(s)
- Haixin Wei
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| | - Zekai Zhao
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| | - Ray Luo
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| |
Collapse
|
25
|
Wang S, He Y, Chen Z, Zhang Q. FCNGRU: Locating Transcription Factor Binding Sites by combing Fully Convolutional Neural Network with Gated Recurrent Unit. IEEE J Biomed Health Inform 2021; 26:1883-1890. [PMID: 34613923 DOI: 10.1109/jbhi.2021.3117616] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Deciphering the relationship between transcription factors (TFs) and DNA sequences is very helpful for computational inference of gene regulation and a comprehensive understanding of gene regulation mechanisms. Transcription factor binding sites (TFBSs) are specific DNA short sequences that play a pivotal role in controlling gene expression through interaction with TF proteins. Although recently many computational and deep learning methods have been proposed to predict TFBSs aiming to predict sequence specificity of TF-DNA binding, there is still a lack of effective methods to directly locate TFBSs. In order to address this problem, we propose FCNGRU combing a fully convolutional neural network (FCN) with the gated recurrent unit (GRU) to directly locate TFBSs in this paper. Furthermore, we present a two-task framework (FCNGRU-double): one is a classification task at nucleotide level which predicts the probability of each nucleotide and locates TFBSs, and the other is a regression task at sequence level which predicts the intensity of each sequence. A series of experiments are conducted on 45 in-vitro datasets collected from the UniPROBE database derived from universal protein binding microarrays (uPBMs). Compared with competing methods, FCNGRU-double achieves much better results on these datasets. Moreover, FCNGRU-double has an advantage over a single-task framework, FCNGRU-single, which only contains the branch of locating TFBSs. In additionwe combine with in vivo datasets to make a further analysis and discussion. The source codes are avaiable at https://github.com/wangguoguoa/FCNGRU.
Collapse
|
26
|
Pearce R, Zhang Y. Toward the solution of the protein structure prediction problem. J Biol Chem 2021; 297:100870. [PMID: 34119522 PMCID: PMC8254035 DOI: 10.1016/j.jbc.2021.100870] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 06/07/2021] [Accepted: 06/09/2021] [Indexed: 11/20/2022] Open
Abstract
Since Anfinsen demonstrated that the information encoded in a protein's amino acid sequence determines its structure in 1973, solving the protein structure prediction problem has been the Holy Grail of structural biology. The goal of protein structure prediction approaches is to utilize computational modeling to determine the spatial location of every atom in a protein molecule starting from only its amino acid sequence. Depending on whether homologous structures can be found in the Protein Data Bank (PDB), structure prediction methods have been historically categorized as template-based modeling (TBM) or template-free modeling (FM) approaches. Until recently, TBM has been the most reliable approach to predicting protein structures, and in the absence of reliable templates, the modeling accuracy sharply declines. Nevertheless, the results of the most recent community-wide assessment of protein structure prediction experiment (CASP14) have demonstrated that the protein structure prediction problem can be largely solved through the use of end-to-end deep machine learning techniques, where correct folds could be built for nearly all single-domain proteins without using the PDB templates. Critically, the model quality exhibited little correlation with the quality of available template structures, as well as the number of sequence homologs detected for a given target protein. Thus, the implementation of deep-learning techniques has essentially broken through the 50-year-old modeling border between TBM and FM approaches and has made the success of high-resolution structure prediction significantly less dependent on template availability in the PDB library.
Collapse
Affiliation(s)
- Robin Pearce
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan, USA.
| |
Collapse
|
27
|
Di Lena P, Baldi P. Fold recognition by scoring protein maps using the congruence coefficient. Bioinformatics 2021; 37:506-513. [PMID: 32976564 DOI: 10.1093/bioinformatics/btaa833] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2020] [Revised: 09/07/2020] [Accepted: 09/10/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Protein fold recognition is a key step for template-based modeling approaches to protein structure prediction. Although closely related folds can be easily identified by sequence homology search in sequence databases, fold recognition is notoriously more difficult when it involves the identification of distantly related homologs. Recent progress in residue-residue contact and distance prediction opens up the possibility of improving fold recognition by using structural information contained in predicted distance and contact maps. RESULTS Here we propose to use the congruence coefficient as a metric of similarity between maps. We prove that this metric has several interesting mathematical properties which allow one to compute in polynomial time its exact mean and variance over all possible (exponentially many) alignments between two symmetric matrices, and assess the statistical significance of similarity between aligned maps. We perform fold recognition tests by recovering predicted target contact/distance maps from the two most recent Critical Assessment of Structure Prediction editions and over 27 000 non-homologous structural templates from the ECOD database. On this large benchmark, we compare fold recognition performances of different alignment tools with their own similarity scores against those obtained using the congruence coefficient. We show that the congruence coefficient overall improves fold recognition over other methods, proving its effectiveness as a general similarity metric for protein map comparison. AVAILABILITY AND IMPLEMENTATION The congruence coefficient software CCpro is available as part of the SCRATCH suite at: http://scratch.proteomics.ics.uci.edu/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pietro Di Lena
- Department of Computer Science and Engineering, University of Bologna, Bologna 40126, Italy
| | - Pierre Baldi
- Department of Computer Science, University of California, Irvine, CA 92697, USA.,Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA
| |
Collapse
|
28
|
Suh D, Lee JW, Choi S, Lee Y. Recent Applications of Deep Learning Methods on Evolution- and Contact-Based Protein Structure Prediction. Int J Mol Sci 2021; 22:6032. [PMID: 34199677 PMCID: PMC8199773 DOI: 10.3390/ijms22116032] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2021] [Revised: 05/29/2021] [Accepted: 05/29/2021] [Indexed: 01/23/2023] Open
Abstract
The new advances in deep learning methods have influenced many aspects of scientific research, including the study of the protein system. The prediction of proteins' 3D structural components is now heavily dependent on machine learning techniques that interpret how protein sequences and their homology govern the inter-residue contacts and structural organization. Especially, methods employing deep neural networks have had a significant impact on recent CASP13 and CASP14 competition. Here, we explore the recent applications of deep learning methods in the protein structure prediction area. We also look at the potential opportunities for deep learning methods to identify unknown protein structures and functions to be discovered and help guide drug-target interactions. Although significant problems still need to be addressed, we expect these techniques in the near future to play crucial roles in protein structural bioinformatics as well as in drug discovery.
Collapse
Affiliation(s)
- Donghyuk Suh
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Jai Woo Lee
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Sun Choi
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Yoonji Lee
- College of Pharmacy, Chung-Ang University, Seoul 06974, Korea
| |
Collapse
|
29
|
Kumar P, Hati AS. Deep convolutional neural network based on adaptive gradient optimizer for fault detection in SCIM. ISA TRANSACTIONS 2021; 111:350-359. [PMID: 33121731 DOI: 10.1016/j.isatra.2020.10.052] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2020] [Revised: 09/28/2020] [Accepted: 10/20/2020] [Indexed: 06/11/2023]
Abstract
Early fault detection in squirrel cage induction motor (SCIM) can minimize the downtime and maximize production. This paper presents an adaptive gradient optimizer based deep convolutional neural network (ADG-dCNN) technique for bearing and rotor faults detection in squirrel cage induction motor. Multiple MEMS accelerometers have been used for vibration data collection, and sensor data fusion is employed in the model training and testing. ADG-dCNN allows the automatic feature extraction from the vibration data and minimizes the need for human expertise and reduces human intervention. It eliminates the error caused by manual feature extraction and selection, which is dependent on prior knowledge of fault types. This paper presents an end-to-end learning fault detection system based on deep CNN. The dataset for training and testing of the proposed method is generated from the test set-up. The proposed classifier attained an average accuracy of 99.70%. This paper also presents the recently developed SHapley Additive exPlanations (SHAP) methodology for evaluation of fault classification from the proposed model. The proposed technique can also be extended to other machinery with multiple sensors owing to its end-to-end learning abilities.
Collapse
Affiliation(s)
- Prashant Kumar
- Department of Mining Machinery Engineering, Indian Institute of Technology (Indian School of Mines), Dhanbad, Jharkhand, 826004, India.
| | - Ananda Shankar Hati
- Department of Mining Machinery Engineering, Indian Institute of Technology (Indian School of Mines), Dhanbad, Jharkhand, 826004, India.
| |
Collapse
|
30
|
Kumar R, Khan FU, Sharma A, Siddiqui MH, Aziz IB, Kamal MA, Ashraf GM, Alghamdi BS, Uddin MS. A deep neural network-based approach for prediction of mutagenicity of compounds. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2021; 28:47641-47650. [PMID: 33895950 DOI: 10.1007/s11356-021-14028-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 04/16/2021] [Indexed: 02/05/2023]
Abstract
We are exposed to various chemical compounds present in the environment, cosmetics, and drugs almost every day. Mutagenicity is a valuable property that plays a significant role in establishing a chemical compound's safety. Exposure and handling of mutagenic chemicals in the environment pose a high health risk; therefore, identification and screening of these chemicals are essential. Considering the time constraints and the pressure to avoid laboratory animals' use, the shift to alternative methodologies that can establish a rapid and cost-effective detection without undue over-conservation seems critical. In this regard, computational detection and identification of the mutagens in environmental samples like drugs, pesticides, dyes, reagents, wastewater, cosmetics, and other substances is vital. From the last two decades, there have been numerous efforts to develop the prediction models for mutagenicity, and by far, machine learning methods have demonstrated some noteworthy performance and reliability. However, the accuracy of such prediction models has always been one of the major concerns for the researchers working in this area. The mutagenicity prediction models were developed using deep neural network (DNN), support vector machine, k-nearest neighbor, and random forest. The developed classifiers were based on 3039 compounds and validated on 1014 compounds; each of them encoded with 1597 molecular feature vectors. DNN-based prediction model yielded highest prediction accuracy of 92.95% and 83.81% with the training and test data, respectively. The area under the receiver's operating curve and precision-recall curve values were found to be 0.894 and 0.838, respectively. The DNN-based classifier not only fits the data with better performance as compared to traditional machine learning algorithms, viz., support vector machine, k-nearest neighbor, and random forest (with and without feature reduction) but also yields better performance metrics. In current work, we propose a DNN-based model to predict mutagenicity of compounds.
Collapse
Affiliation(s)
- Rajnish Kumar
- Amity Institute of Biotechnology, Amity University Uttar Pradesh, Lucknow Campus, Lucknow, Uttar Pradesh, India.
| | - Farhat Ullah Khan
- Computer and Information Sciences Department, Universiti Teknologi Petronas, 32610, Seri Iskander, Perak, Malaysia
| | - Anju Sharma
- Department of Applied Science, Indian Institute of Information Technology, Allahabad, Uttar Pradesh, India
| | - Mohammed Haris Siddiqui
- Department of Bioengineering, Integral University, Dasauli, P.O. Basha, Kursi Road, Lucknow, Uttar Pradesh, India
| | - Izzatdin Ba Aziz
- Computer and Information Sciences Department, Universiti Teknologi Petronas, 32610, Seri Iskander, Perak, Malaysia
| | - Mohammad Amjad Kamal
- West China School of Nursing / Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu 610041, Sichuan, China
- King Fahd Medical Research Center, King Abdulaziz University, P. O. Box 80216, Jeddah 21589, Saudi Arabia
- Enzymoics, Novel Global Community Educational Foundation, Hebersham, New South Wales, Australia
| | - Ghulam Md Ashraf
- Pre-Clinical Research Unit, King Fahd Medical Research Center, King Abdulaziz University, Jeddah, Saudi Arabia.
- Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia.
| | - Badrah S Alghamdi
- Pre-Clinical Research Unit, King Fahd Medical Research Center, King Abdulaziz University, Jeddah, Saudi Arabia
- Department of Physiology, Neuroscience Unit, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Md Sahab Uddin
- Department of Pharmacy, Southeast University, Dhaka, Bangladesh.
- Pharmakon Neuroscience Research Network, Dhaka, Bangladesh.
| |
Collapse
|
31
|
Wang Y, Yang Y, Chen S, Wang J. DeepDRK: a deep learning framework for drug repurposing through kernel-based multi-omics integration. Brief Bioinform 2021; 22:6210072. [PMID: 33822890 DOI: 10.1093/bib/bbab048] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 01/16/2021] [Accepted: 01/30/2021] [Indexed: 12/11/2022] Open
Abstract
Recent pharmacogenomic studies that generate sequencing data coupled with pharmacological characteristics for patient-derived cancer cell lines led to large amounts of multi-omics data for precision cancer medicine. Among various obstacles hindering clinical translation, lacking effective methods for multimodal and multisource data integration is becoming a bottleneck. Here we proposed DeepDRK, a machine learning framework for deciphering drug response through kernel-based data integration. To transfer information among different drugs and cancer types, we trained deep neural networks on more than 20 000 pan-cancer cell line-anticancer drug pairs. These pairs were characterized by kernel-based similarity matrices integrating multisource and multi-omics data including genomics, transcriptomics, epigenomics, chemical properties of compounds and known drug-target interactions. Applied to benchmark cancer cell line datasets, our model surpassed previous approaches with higher accuracy and better robustness. Then we applied our model on newly established patient-derived cancer cell lines and achieved satisfactory performance with AUC of 0.84 and AUPRC of 0.77. Moreover, DeepDRK was used to predict clinical response of cancer patients. Notably, the prediction of DeepDRK correlated well with clinical outcome of patients and revealed multiple drug repurposing candidates. In sum, DeepDRK provided a computational method to predict drug response of cancer cells from integrating pharmacogenomic datasets, offering an alternative way to prioritize repurposing drugs in precision cancer treatment. The DeepDRK is freely available via https://github.com/wangyc82/DeepDRK.
Collapse
Affiliation(s)
- Yongcui Wang
- Key Laboratory of Adaptation and Evolution of Plateau Biota at Northwest Institute of Plateau Biology, Chinese Academy of Sciences, China
| | - Yingxi Yang
- Department of Chemical and Biological Engineering at The Hong Kong University of Science and Technology, China
| | - Shilong Chen
- Key Laboratory of Adaptation and Evolution of Plateau Biota at Institute of Sanjiangyuan National Park, Chinese Academy of Sciences, China
| | - Jiguang Wang
- Division of Life Science, Department of Chemical and Biological Engineering, and State Key Laboratory of Molecular Neuroscience at The Hong Kong University of Science and Technology, China
| |
Collapse
|
32
|
Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks. PLoS Comput Biol 2021; 17:e1008865. [PMID: 33770072 PMCID: PMC8026059 DOI: 10.1371/journal.pcbi.1008865] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 04/07/2021] [Accepted: 03/10/2021] [Indexed: 12/24/2022] Open
Abstract
The topology of protein folds can be specified by the inter-residue contact-maps and accurate contact-map prediction can help ab initio structure folding. We developed TripletRes to deduce protein contact-maps from discretized distance profiles by end-to-end training of deep residual neural-networks. Compared to previous approaches, the major advantage of TripletRes is in its ability to learn and directly fuse a triplet of coevolutionary matrices extracted from the whole-genome and metagenome databases and therefore minimize the information loss during the course of contact model training. TripletRes was tested on a large set of 245 non-homologous proteins from CASP 11&12 and CAMEO experiments and outperformed other top methods from CASP12 by at least 58.4% for the CASP 11&12 targets and 44.4% for the CAMEO targets in the top-L long-range contact precision. On the 31 FM targets from the latest CASP13 challenge, TripletRes achieved the highest precision (71.6%) for the top-L/5 long-range contact predictions. It was also shown that a simple re-training of the TripletRes model with more proteins can lead to further improvement with precisions comparable to state-of-the-art methods developed after CASP13. These results demonstrate a novel efficient approach to extend the power of deep convolutional networks for high-accuracy medium- and long-range protein contact-map predictions starting from primary sequences, which are critical for constructing 3D structure of proteins that lack homologous templates in the PDB library. Ab initio protein folding has been a major unsolved problem in computational biology for more than half a century. Recent community-wide Critical Assessment of Structure Prediction (CASP) experiments have witnessed exciting progress on ab initio structure prediction, which was mainly powered by the boosting of contact-map prediction as the latter can be used as constraints to guide ab initio folding simulations. In this work, we proposed a new open-source deep-learning architecture, TripletRes, built on the residual convolutional neural networks for high-accuracy contact prediction. The large-scale benchmark and blind test results demonstrate competitive performance of the proposed methods to other top approaches in predicting medium- and long-range contact-maps that are critical for guiding protein folding simulations. Detailed data analyses showed that the major advantage of TripletRes lies in the unique protocol to fuse multiple evolutionary feature matrices which are directly extracted from whole-genome and metagenome databases and therefore minimize the information loss during the contact model training.
Collapse
|
33
|
Junqueira Alves C, Silva Ladeira J, Hannah T, Pedroso Dias RJ, Zabala Capriles PV, Yotoko K, Zou H, Friedel RH. Evolution and Diversity of Semaphorins and Plexins in Choanoflagellates. Genome Biol Evol 2021; 13:6149127. [PMID: 33624753 PMCID: PMC8011033 DOI: 10.1093/gbe/evab035] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/21/2021] [Indexed: 12/22/2022] Open
Abstract
Semaphorins and plexins are cell surface ligand/receptor proteins that affect cytoskeletal dynamics in metazoan cells. Interestingly, they are also present in Choanoflagellata, a class of unicellular heterotrophic flagellates that forms the phylogenetic sister group to Metazoa. Several members of choanoflagellates are capable of forming transient colonies, whereas others reside solitary inside exoskeletons; their molecular diversity is only beginning to emerge. Here, we surveyed genomics data from 22 choanoflagellate species and detected semaphorin/plexin pairs in 16 species. Choanoflagellate semaphorins (Sema-FN1) contain several domain features distinct from metazoan semaphorins, including an N-terminal Reeler domain that may facilitate dimer stabilization, an array of fibronectin type III domains, a variable serine/threonine-rich domain that is a potential site for O-linked glycosylation, and a SEA domain that can undergo autoproteolysis. In contrast, choanoflagellate plexins (Plexin-1) harbor a domain arrangement that is largely identical to metazoan plexins. Both Sema-FN1 and Plexin-1 also contain a short homologous motif near the C-terminus, likely associated with a shared function. Three-dimensional molecular models revealed a highly conserved structural architecture of choanoflagellate Plexin-1 as compared to metazoan plexins, including similar predicted conformational changes in a segment that is involved in the activation of the intracellular Ras-GAP domain. The absence of semaphorins and plexins in several choanoflagellate species did not appear to correlate with unicellular versus colonial lifestyle or ecological factors such as fresh versus salt water environment. Together, our findings support a conserved mechanism of semaphorin/plexin proteins in regulating cytoskeletal dynamics in unicellular and multicellular organisms.
Collapse
Affiliation(s)
- Chrystian Junqueira Alves
- Friedman Brain Institute, Nash Family Department of Neuroscience and Department of Neurosurgery, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Júlia Silva Ladeira
- Programa de Pós-graduação em Modelagem Computacional, Universidade Federal de Juiz de Fora, Minas Gerais, Brazil
| | - Theodore Hannah
- Friedman Brain Institute, Nash Family Department of Neuroscience and Department of Neurosurgery, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Roberto J Pedroso Dias
- Departamento de Zoologia, Instituto de Ciências Biológicas, Universidade Federal de Juiz de Fora, Minas Gerais, Brazil
| | - Priscila V Zabala Capriles
- Programa de Pós-graduação em Modelagem Computacional, Universidade Federal de Juiz de Fora, Minas Gerais, Brazil
| | - Karla Yotoko
- Departamento de Biologia Geral, Universidade Federal de Viçosa, Minas Gerais, Brazil
| | - Hongyan Zou
- Friedman Brain Institute, Nash Family Department of Neuroscience and Department of Neurosurgery, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Roland H Friedel
- Friedman Brain Institute, Nash Family Department of Neuroscience and Department of Neurosurgery, Icahn School of Medicine at Mount Sinai, New York, New York
| |
Collapse
|
34
|
Ibrahim MA, Ghani Khan MU, Mehmood F, Asim MN, Mahmood W. GHS-NET a generic hybridized shallow neural network for multi-label biomedical text classification. J Biomed Inform 2021; 116:103699. [PMID: 33601013 DOI: 10.1016/j.jbi.2021.103699] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 11/30/2020] [Accepted: 02/02/2021] [Indexed: 01/16/2023]
Abstract
Exponential growth of biomedical literature and clinical data demands more robust yet precise computational methodologies to extract useful insights from biomedical literature and to perform accurate assignment of disease-specific codes. Such approaches can largely enhance the effectiveness of diverse biomedicine and bioinformatics applications. State-of-the-art computational biomedical text classification methodologies either solely leverage discrimintaive features extracted through convolution operations performed by deep convolutional neural network or contextual information extracted by recurrent neural network. However, none of the methodology takes advantage of both convolutional and recurrent neural networks. Further, existing methodologies lack to produce decent performance for the classification of different genre biomedical text such as biomedical literature or clinical notes. We, for the very first time, present a generic deep learning based hybrid multi-label classification methodology namely GHS-NET which can be utilized to accurately classify biomedical text of diverse genre. GHS-NET makes use of convolutional neural network to extract most discriminative features and bi-directional Long Short-Term Memory to acquire contextual information. GHS-NET effectiveness is evaluated for extreme multi-label biomedical literature classification and assignment of ICD-9 codes to clinical notes. For the task of extreme multi-label biomedical literature classification, performance comparison of GHS-Net and state-of-the-art deep learning based methodology reveals that GHS-Net marks the increment of 1%, 6%, and 1% for hallmarks of cancer dataset, 10%, 16%, and 11% for chemical exposure dataset in terms of precision, recall, and F1-score. For the task of clinical notes classification, GHS-Net outperforms previous best deep learning based methodology over Medical Information Mart for Intensive Care dataset (MIMIC-III) by the significant margin of 6%, 8% in terms of recall and F1-score. GHS-NET is available as a web service at1 and potentially can be used to accurately classify multi-variate disease and chemical exposure specific text.
Collapse
Affiliation(s)
- Muhammad Ali Ibrahim
- Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan; German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany
| | - Muhammad Usman Ghani Khan
- Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan; Department of Computer Science, University of Engineering and Technology (UET), Lahore, Pakistan
| | - Faiza Mehmood
- Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan
| | - Muhammad Nabeel Asim
- Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan; German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany.
| | - Waqar Mahmood
- Intelligent Criminology Research Lab, National Center of Artificial Intelligence, Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan
| |
Collapse
|
35
|
Sharma A, Kumar R, Ranjta S, Varadwaj PK. SMILES to Smell: Decoding the Structure-Odor Relationship of Chemical Compounds Using the Deep Neural Network Approach. J Chem Inf Model 2021; 61:676-688. [PMID: 33449694 DOI: 10.1021/acs.jcim.0c01288] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Finding the relationship between the structure of an odorant molecule and its associated smell has always been an extremely challenging task. The major limitation in establishing the structure-odor relation is the vague and ambiguous nature of the descriptor-labeling, especially when the sources of odorant molecules are different. With the advent of deep networks, data-driven approaches have been substantiated to achieve more accurate linkages between the chemical structure and its smell. In this study, the deep neural network (DNN) with physiochemical properties and molecular fingerprints (PPMF) and the convolution neural network (CNN) with chemical-structure images (IMG) are developed to predict the smells of chemicals using their SMILES notations. A data set of 5185 chemical compounds with 104 smell percepts was used to develop the multilabel prediction models. The accuracies of smell prediction from DNN + PPMF and CNN + IMG (Xception based) were found to be 97.3 and 98.3%, respectively, when applied on an independent test set of chemicals. The deep learning architecture combining both DNN + PPMF and CNN + IMG prediction models is proposed, which classifies smells and may help understand the generic mechanism underlying the relationship between chemical structure and smell perception.
Collapse
Affiliation(s)
- Anju Sharma
- Department of Applied Science, Indian Institute of Information Technology, Allahabad 211012, Uttar Pradesh, India.,Amity Institute of Biotechnology, Amity University Uttar Pradesh, Lucknow Campus 226010, Uttar Pradesh, India
| | - Rajnish Kumar
- Amity Institute of Biotechnology, Amity University Uttar Pradesh, Lucknow Campus 226010, Uttar Pradesh, India
| | - Shabnam Ranjta
- Department of Chemistry, SGGS College, Chandigarh 160019, India
| | - Pritish Kumar Varadwaj
- Department of Applied Science, Indian Institute of Information Technology, Allahabad 211012, Uttar Pradesh, India
| |
Collapse
|
36
|
Rodrigues JF, Florea L, de Oliveira MCF, Diamond D, Oliveira ON. Big data and machine learning for materials science. DISCOVER MATERIALS 2021; 1:12. [PMID: 33899049 PMCID: PMC8054236 DOI: 10.1007/s43939-021-00012-0] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 04/01/2021] [Indexed: 05/11/2023]
Abstract
Herein, we review aspects of leading-edge research and innovation in materials science that exploit big data and machine learning (ML), two computer science concepts that combine to yield computational intelligence. ML can accelerate the solution of intricate chemical problems and even solve problems that otherwise would not be tractable. However, the potential benefits of ML come at the cost of big data production; that is, the algorithms demand large volumes of data of various natures and from different sources, from material properties to sensor data. In the survey, we propose a roadmap for future developments with emphasis on computer-aided discovery of new materials and analysis of chemical sensing compounds, both prominent research fields for ML in the context of materials science. In addition to providing an overview of recent advances, we elaborate upon the conceptual and practical limitations of big data and ML applied to materials science, outlining processes, discussing pitfalls, and reviewing cases of success and failure.
Collapse
Affiliation(s)
- Jose F. Rodrigues
- Institute of Mathematical Sciences and Computing, University of São Paulo (USP), São Carlos, SP Brazil
| | - Larisa Florea
- SFI Research Centre for Advanced Materials and BioEngineering Research Trinity College Dublin, The University of Dublin, Dublin, Ireland
| | - Maria C. F. de Oliveira
- Institute of Mathematical Sciences and Computing, University of São Paulo (USP), São Carlos, SP Brazil
| | - Dermot Diamond
- Insight Centre for Data Analytics, National Centre for Sensor Research, Dublin City University, Dublin 9, Dublin, Ireland
| | - Osvaldo N. Oliveira
- São Carlos Institute of Physics, University of São Paulo (USP), São Carlos, SP Brazil
| |
Collapse
|
37
|
Gao W, Mahajan SP, Sulam J, Gray JJ. Deep Learning in Protein Structural Modeling and Design. PATTERNS (NEW YORK, N.Y.) 2020; 1:100142. [PMID: 33336200 PMCID: PMC7733882 DOI: 10.1016/j.patter.2020.100142] [Citation(s) in RCA: 100] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Deep learning is catalyzing a scientific revolution fueled by big data, accessible toolkits, and powerful computational resources, impacting many fields, including protein structural modeling. Protein structural modeling, such as predicting structure from amino acid sequence and evolutionary information, designing proteins toward desirable functionality, or predicting properties or behavior of a protein, is critical to understand and engineer biological systems at the molecular level. In this review, we summarize the recent advances in applying deep learning techniques to tackle problems in protein structural modeling and design. We dissect the emerging approaches using deep learning techniques for protein structural modeling and discuss advances and challenges that must be addressed. We argue for the central importance of structure, following the "sequence → structure → function" paradigm. This review is directed to help both computational biologists to gain familiarity with the deep learning methods applied in protein modeling, and computer scientists to gain perspective on the biologically meaningful problems that may benefit from deep learning techniques.
Collapse
Affiliation(s)
- Wenhao Gao
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Sai Pooja Mahajan
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeremias Sulam
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeffrey J. Gray
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
38
|
Alam T, Al-Absi HRH, Schmeier S. Deep Learning in LncRNAome: Contribution, Challenges, and Perspectives. Noncoding RNA 2020; 6:E47. [PMID: 33266128 PMCID: PMC7711891 DOI: 10.3390/ncrna6040047] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2020] [Revised: 10/27/2020] [Accepted: 11/06/2020] [Indexed: 12/11/2022] Open
Abstract
Long non-coding RNAs (lncRNA), the pervasively transcribed part of the mammalian genome, have played a significant role in changing our protein-centric view of genomes. The abundance of lncRNAs and their diverse roles across cell types have opened numerous avenues for the research community regarding lncRNAome. To discover and understand lncRNAome, many sophisticated computational techniques have been leveraged. Recently, deep learning (DL)-based modeling techniques have been successfully used in genomics due to their capacity to handle large amounts of data and produce relatively better results than traditional machine learning (ML) models. DL-based modeling techniques have now become a choice for many modeling tasks in the field of lncRNAome as well. In this review article, we summarized the contribution of DL-based methods in nine different lncRNAome research areas. We also outlined DL-based techniques leveraged in lncRNAome, highlighting the challenges computational scientists face while developing DL-based models for lncRNAome. To the best of our knowledge, this is the first review article that summarizes the role of DL-based techniques in multiple areas of lncRNAome.
Collapse
Affiliation(s)
- Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar;
| | - Hamada R. H. Al-Absi
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar;
| | - Sebastian Schmeier
- School of Natural and Computational Sciences, Massey University, Auckland 0632, New Zealand;
| |
Collapse
|
39
|
Liu GH, Zhang BW, Qian G, Wang B, Mao B, Bichindaritz I. Bioimage-Based Prediction of Protein Subcellular Location in Human Tissue with Ensemble Features and Deep Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1966-1980. [PMID: 31107658 DOI: 10.1109/tcbb.2019.2917429] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Prediction of protein subcellular location has currently become a hot topic because it has been proven to be useful for understanding both the disease mechanisms and novel drug design. With the rapid development of automated microscopic imaging technology in recent years, classification methods of bioimage-based protein subcellular location have attracted considerable attention for images can describe the protein distribution intuitively and in detail. In the current study, a prediction method of protein subcellular location was proposed based on multi-view image features that are extracted from three different views, including the four texture features of the original image, the global and local features of the protein extracted from the protein channel images after color segmentation, and the global features of DNA extracted from the DNA channel image. Finally, the extracted features were combined together to improve the performance of subcellular localization prediction. From the performance comparison of different combination features under the same classifier, the best ensemble features could be obtained. In this work, a classifier based on Stacked Auto-encoders and the random forest was also put forward. To improve the prediction results, the deep network was combined with the traditional statistical classification methods. Stringent cross-validation and independent validation tests on the benchmark dataset demonstrated the efficacy of the proposed method.
Collapse
|
40
|
Liu J, Zhou XG, Zhang Y, Zhang GJ. CGLFold: a contact-assisted de novo protein structure prediction using global exploration and loop perturbation sampling algorithm. Bioinformatics 2020; 36:2443-2450. [PMID: 31860059 DOI: 10.1093/bioinformatics/btz943] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 12/10/2019] [Accepted: 12/18/2019] [Indexed: 12/27/2022] Open
Abstract
MOTIVATION Regions that connect secondary structure elements in a protein are known as loops, whose slight change will produce dramatic effect on the entire topology. This study investigates whether the accuracy of protein structure prediction can be improved using a loop-specific sampling strategy. RESULTS A novel de novo protein structure prediction method that combines global exploration and loop perturbation is proposed in this study. In the global exploration phase, the fragment recombination and assembly are used to explore the massive conformational space and generate native-like topology. In the loop perturbation phase, a loop-specific local perturbation model is designed to improve the accuracy of the conformation and is solved by differential evolution algorithm. These two phases enable a cooperation between global exploration and local exploitation. The filtered contact information is used to construct the conformation selection model for guiding the sampling. The proposed CGLFold is tested on 145 benchmark proteins, 14 free modeling (FM) targets of CASP13 and 29 FM targets of CASP12. The experimental results show that the loop-specific local perturbation can increase the structure diversity and success rate of conformational update and gradually improve conformation accuracy. CGLFold obtains template modeling score ≥ 0.5 models on 95 standard test proteins, 7 FM targets of CASP13 and 9 FM targets of CASP12. AVAILABILITY AND IMPLEMENTATION The source code and executable versions are freely available at https://github.com/iobio-zjut/CGLFold. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Liu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Xiao-Gen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
41
|
Lopez-Del Rio A, Martin M, Perera-Lluna A, Saidi R. Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction. Sci Rep 2020; 10:14634. [PMID: 32884053 PMCID: PMC7471694 DOI: 10.1038/s41598-020-71450-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 08/06/2020] [Indexed: 11/08/2022] Open
Abstract
The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-low-importance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https://github.com/b2slab/padding_benchmark .
Collapse
Affiliation(s)
- Angela Lopez-Del Rio
- B2SLab, Department d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028, Barcelona, Spain.
- Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Dèu, 08950, Esplugues de Llobregat, Spain.
| | - Maria Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| | - Alexandre Perera-Lluna
- B2SLab, Department d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028, Barcelona, Spain
- Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Dèu, 08950, Esplugues de Llobregat, Spain
| | - Rabie Saidi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| |
Collapse
|
42
|
Sun D, Gong X. Tetramer protein complex interface residue pairs prediction with LSTM combined with graph representations. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140504. [PMID: 32717382 DOI: 10.1016/j.bbapap.2020.140504] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2020] [Revised: 06/30/2020] [Accepted: 07/16/2020] [Indexed: 10/23/2022]
Abstract
MOTIVATION Protein-protein interactions are important for many biological processes. Theoretical understanding of the structurally determining factors of interaction sites will help to understand the underlying mechanism of protein-protein interactions. Taking advantage of advanced mathematical methods to correctly predict interaction sites will be useful. Although some previous studies have been devoted to the interaction interface of protein monomer and the interface residues between chains of protein dimers, very few studies about the interface residues prediction of protein multimers, including trimers, tetramer and even more monomers in a large protein complex. As we all know, a large number of proteins function with the form of multibody protein complexes. And the complexity of the protein multimers structure causes the difficulty of interface residues prediction on them. So, we hope to build a method for the prediction of protein tetramer interface residue pairs. RESULTS Here, we developed a new deep network based on LSTM network combining with graph to predict protein tetramers interaction interface residue pairs. On account of the protein structure data is not the same as the image or video data which is well-arranged matrices, namely the Euclidean Structure mentioned in many researches. Because the Non-Euclidean Structure data can't keep the translation invariance, and we hope to extract some spatial features from this kind of data applying on deep learning, an algorithm combining with graph was developed to predict the interface residue pairs of protein interactions based on a topological graph building a relationship between vertexes and edges in graph theory combining multilayer Long Short-Term Memory network. First, selecting the training and test samples from the Protein Data Bank, and then extracting the physicochemical property features and the geometric features of surface residue associated with interfacial properties. Subsequently, we transform the protein multimers data to topological graphs and predict protein interaction interface residue pairs using the model. In addition, different types of evaluation indicators verified its validity.
Collapse
Affiliation(s)
- Daiwen Sun
- Mathematics Intelligence Application LAB, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, PR China
| | - Xinqi Gong
- Mathematics Intelligence Application LAB, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, PR China; Beijing Advanced Innovation Center for Structural Biology, Tsinghua Univeristy, Beijing 100091, PR China.
| |
Collapse
|
43
|
Xie R, Li J, Wang J, Dai W, Leier A, Marquez-Lago TT, Akutsu T, Lithgow T, Song J, Zhang Y. DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Brief Bioinform 2020; 22:5864586. [PMID: 32599617 DOI: 10.1093/bib/bbaa125] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 05/22/2020] [Accepted: 05/22/2020] [Indexed: 12/14/2022] Open
Abstract
Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user's viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.
Collapse
Affiliation(s)
- Ruopeng Xie
- Bioinformatics Lab at Guilin University of Electronic Technology
| | - Jiahui Li
- Bioinformatics Lab at Guilin University of Electronic Technology
| | - Jiawei Wang
- Biomedicine Discovery Institute and the Department of Microbiology at Monash University, Australia
| | - Wei Dai
- School of Computer Science and Information Security, Guilin University of Electronic Technology, China
| | - André Leier
- Department of Genetics and the Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham (UAB) School of Medicine, USA
| | - Tatiana T Marquez-Lago
- Department of Genetics and the Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham (UAB) School of Medicine, USA
| | | | - Trevor Lithgow
- Biomedicine Discovery Institute and the Director of the Centre to Impact AMR at Monash University, Australia
| | - Jiangning Song
- Group Leader in the Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Yanju Zhang
- Leiden Institute of Advanced Computer Science, Leiden University
| |
Collapse
|
44
|
Getting to Know Your Neighbor: Protein Structure Prediction Comes of Age with Contextual Machine Learning. J Comput Biol 2020; 27:796-814. [DOI: 10.1089/cmb.2019.0193] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
|
45
|
Zhang GJ, Ma LF, Wang XQ, Zhou XG. Secondary Structure and Contact Guided Differential Evolution for Protein Structure Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1068-1081. [PMID: 30295627 DOI: 10.1109/tcbb.2018.2873691] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Ab initio protein tertiary structure prediction is one of the long-standing problems in structural bioinformatics. With the help of residue-residue contact and secondary structure prediction information, the accuracy of ab initio structure prediction can be enhanced. In this study, an improved differential evolution with secondary structure and residue-residue contact information referred to as SCDE is proposed for protein structure prediction. In SCDE, two score models based on secondary structure and contact information are proposed, and two selection strategies, namely, secondary structure-based selection strategy and contact-based selection strategy, are designed to guide conformation space search. A probability distribution function is designed to balance these two selection strategies. Experimental results on a benchmark dataset with 28 proteins and four free model targets in CASP12 demonstrate that the proposed SCDE is effective and efficient.
Collapse
|
46
|
Protein Contact Map Prediction Based on ResNet and DenseNet. BIOMED RESEARCH INTERNATIONAL 2020; 2020:7584968. [PMID: 32337273 PMCID: PMC7165324 DOI: 10.1155/2020/7584968] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Accepted: 03/05/2020] [Indexed: 11/18/2022]
Abstract
Residue-residue contact prediction has become an increasingly important tool for modeling the three-dimensional structure of a protein when no homologous structure is available. Ultradeep residual neural network (ResNet) has become the most popular method for making contact predictions because it captures the contextual information between residues. In this paper, we propose a novel deep neural network framework for contact prediction which combines ResNet and DenseNet. This framework uses 1D ResNet to process sequential features, and besides PSSM, SS3, and solvent accessibility, we have introduced a new feature, position-specific frequency matrix (PSFM), as an input. Using ResNet's residual module and identity mapping, it can effectively process sequential features after which the outer concatenation function is used for sequential and pairwise features. Prediction accuracy is improved following a final processing step using the dense connection of DenseNet. The prediction accuracy of the protein contact map shows that our method is more effective than other popular methods due to the new network architecture and the added feature input.
Collapse
|
47
|
Fang C, Jia Y, Hu L, Lu Y, Wang H. IMPContact: An Interhelical Residue Contact Prediction Method. BIOMED RESEARCH INTERNATIONAL 2020; 2020:4569037. [PMID: 32309431 PMCID: PMC7140131 DOI: 10.1155/2020/4569037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Accepted: 03/09/2020] [Indexed: 11/17/2022]
Abstract
As an important category of proteins, alpha-helix transmembrane proteins (αTMPs) play an important role in various biological activities. Because the solved αTMP structures are inadequate, predicting the residue contacts among the transmembrane segments of an αTMP exhibits the basis of protein fold, which can be used to further discover more protein functions. A few efforts have been devoted to predict the interhelical residue contact using machine learning methods based on the prior knowledge of transmembrane protein structure. However, it is still a challenge to improve the prediction accuracy, while the deep learning method provides an opportunity to utilize the structural knowledge in a different insight. For this purpose, we proposed a novel αTMP residue-residue contact prediction method IMPContact, in which a convolutional neural network (CNN) was applied to recognize those interhelical contacts in a TMP using its specific structural features. There were four sequence-based TMP-specific features selected to descript a pair of residues, namely, evolutionary covariation, predicted topology structure, residue relative position, and evolutionary conservation. An up-to-date dataset was used to train and test the IMPContact; our method achieved better performance compared to peer methods. In the case studies, IHRCs in the regular transmembrane helixes were better predicted than in the irregular ones.
Collapse
Affiliation(s)
- Chao Fang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Yajie Jia
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Lihong Hu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Yinghua Lu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Department of Computer Science, College of Humanities & Sciences of Northeast Normal University, Changchun 130117, China
| | - Han Wang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
- Department of Computer Science, College of Humanities & Sciences of Northeast Normal University, Changchun 130117, China
| |
Collapse
|
48
|
Smolarczyk T, Roterman-Konieczna I, Stapor K. Protein Secondary Structure Prediction: A Review of Progress and Directions. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017104639] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Over the last few decades, a search for the theory of protein folding has
grown into a full-fledged research field at the intersection of biology, chemistry and informatics.
Despite enormous effort, there are still open questions and challenges, like understanding the rules
by which amino acid sequence determines protein secondary structure.
Objective:
In this review, we depict the progress of the prediction methods over the years and
identify sources of improvement.
Methods:
The protein secondary structure prediction problem is described followed by the discussion
on theoretical limitations, description of the commonly used data sets, features and a review
of three generations of methods with the focus on the most recent advances. Additionally, methods
with available online servers are assessed on the independent data set.
Results:
The state-of-the-art methods are currently reaching almost 88% for 3-class prediction and
76.5% for an 8-class prediction.
Conclusion:
This review summarizes recent advances and outlines further research directions.
Collapse
Affiliation(s)
- Tomasz Smolarczyk
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| | - Irena Roterman-Konieczna
- Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Krakow, Poland
| | - Katarzyna Stapor
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
49
|
Torrisi M, Pollastri G, Le Q. Deep learning methods in protein structure prediction. Comput Struct Biotechnol J 2020; 18:1301-1310. [PMID: 32612753 PMCID: PMC7305407 DOI: 10.1016/j.csbj.2019.12.011] [Citation(s) in RCA: 132] [Impact Index Per Article: 26.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 12/19/2019] [Accepted: 12/20/2019] [Indexed: 01/01/2023] Open
Abstract
Protein Structure Prediction is a central topic in Structural Bioinformatics. Since the '60s statistical methods, followed by increasingly complex Machine Learning and recently Deep Learning methods, have been employed to predict protein structural information at various levels of detail. In this review, we briefly introduce the problem of protein structure prediction and essential elements of Deep Learning (such as Convolutional Neural Networks, Recurrent Neural Networks and basic feed-forward Neural Networks they are founded on), after which we discuss the evolution of predictive methods for one-dimensional and two-dimensional Protein Structure Annotations, from the simple statistical methods of the early days, to the computationally intensive highly-sophisticated Deep Learning algorithms of the last decade. In the process, we review the growth of the databases these algorithms are based on, and how this has impacted our ability to leverage knowledge about evolution and co-evolution to achieve improved predictions. We conclude this review outlining the current role of Deep Learning techniques within the wider pipelines to predict protein structures and trying to anticipate what challenges and opportunities may arise next.
Collapse
Affiliation(s)
- Mirko Torrisi
- School of Computer Science, University College Dublin, Ireland
| | | | - Quan Le
- Centre for Applied Data Analytics Research, University College Dublin, Ireland
| |
Collapse
|
50
|
Xu G, Wang Q, Ma J. OPUS-Refine: A Fast Sampling-Based Framework for Refining Protein Backbone Torsion Angles and Global Conformation. J Chem Theory Comput 2020; 16:1359-1366. [DOI: 10.1021/acs.jctc.9b01054] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Affiliation(s)
- Gang Xu
- Multiscale Research Institute of Complex Systems, Fudan University, Shanghai 200433, China
| | - Qinghua Wang
- Verna and Marrs Mclean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, One Baylor Plaza, BCM-125, Houston, Texas 77030, United States
| | - Jianpeng Ma
- Multiscale Research Institute of Complex Systems, Fudan University, Shanghai 200433, China
- Verna and Marrs Mclean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, One Baylor Plaza, BCM-125, Houston, Texas 77030, United States
- Department of Bioengineering, Rice University, Houston, Texas 77005, United States
| |
Collapse
|