51
|
ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations. Comput Biol Chem 2021; 93:107537. [PMID: 34217007 DOI: 10.1016/j.compbiolchem.2021.107537] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 05/09/2021] [Accepted: 06/26/2021] [Indexed: 01/08/2023]
Abstract
MOTIVATION Primary and secondary active transport are two types of active transport that involve using energy to move the substances. Active transport mechanisms do use proteins to assist in transport and play essential roles to regulate the traffic of ions or small molecules across a cell membrane against the concentration gradient. In this study, the two main types of proteins involved in such transport are classified from transmembrane transport proteins. We propose a Support Vector Machine (SVM) with contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) to represent protein sequences. BERT is a powerful model in transfer learning, a deep learning language representation model developed by Google and one of the highest performing pre-trained model for Natural Language Processing (NLP) tasks. The idea of transfer learning with pre-trained model from BERT is applied to extract fixed feature vectors from the hidden layers and learn contextual relations between amino acids in the protein sequence. Therefore, the contextualized word representations of proteins are introduced to effectively model complex structures of amino acids in the sequence and the variations of these amino acids in the context. By generating context information, we capture multiple meanings for the same amino acid to reveal the importance of specific residues in the protein sequence. RESULTS The performance of the proposed method is evaluated using five-fold cross-validation and independent test. The proposed method achieves an accuracy of 85.44 %, 88.74 % and 92.84 % for Class-1, Class-2, and Class-3, respectively. Experimental results show that this approach can outperform from other feature extraction methods using context information, effectively classify two types of active transport and improve the overall performance.
Collapse
|
52
|
Littmann M, Bordin N, Heinzinger M, Schütze K, Dallago C, Orengo C, Rost B. Clustering FunFams using sequence embeddings improves EC purity. Bioinformatics 2021; 37:3449-3455. [PMID: 33978744 PMCID: PMC8545299 DOI: 10.1093/bioinformatics/btab371] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 04/02/2021] [Accepted: 05/11/2021] [Indexed: 12/05/2022] Open
Abstract
Motivation Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. Results We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. Availability and implementation Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maria Littmann
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Konstantin Schütze
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
| | - Christian Dallago
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.,Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
53
|
Sharma R, Shrivastava S, Kumar Singh S, Kumar A, Saxena S, Kumar Singh R. Deep-ABPpred: identifying antibacterial peptides in protein sequences using bidirectional LSTM with word2vec. Brief Bioinform 2021; 22:6204762. [PMID: 33784381 DOI: 10.1093/bib/bbab065] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 02/04/2021] [Indexed: 12/13/2022] Open
Abstract
The overuse of antibiotics has led to emergence of antimicrobial resistance, and as a result, antibacterial peptides (ABPs) are receiving significant attention as an alternative. Identification of effective ABPs in lab from natural sources is a cost-intensive and time-consuming process. Therefore, there is a need for the development of in silico models, which can identify novel ABPs in protein sequences for chemical synthesis and testing. In this study, we propose a deep learning classifier named Deep-ABPpred that can identify ABPs in protein sequences. We developed Deep-ABPpred using bidirectional long short-term memory algorithm with amino acid level features from word2vec. The results show that Deep-ABPpred outperforms other state-of-the-art ABP classifiers on both test and independent datasets. Our proposed model achieved the precision of approximately 97 and 94% on test dataset and independent dataset, respectively. The high precision suggests applicability of Deep-ABPpred in proposing novel ABPs for synthesis and experimentation. By utilizing Deep-ABPpred, we identified ABPs in the tail protein sequences of Streptococcus bacteriophages, chemically synthesized identified peptides in lab and tested their activity in vitro. These ABPs showed potent antibacterial activity against selected Gram-positive and Gram-negative bacteria, which confirms the capability of Deep-ABPpred in identifying novel ABPs in protein sequences. Based on the proposed approach, an online prediction server is also developed, which is freely accessible at https://abppred.anvil.app/. This web server takes the protein sequence as input and provides ABPs with high probability (>0.95) as output.
Collapse
Affiliation(s)
- Ritesh Sharma
- Department of Computer Science and Engineering at IIT (BHU), Varanasi, India
| | | | - Sanjay Kumar Singh
- Department of Computer Science and Engineering at IIT (BHU), Varanasi, India
| | - Abhinav Kumar
- Department of Computer Science and Engineering at IIT (BHU), Varanasi, India
| | - Sonal Saxena
- Division of Veterinary Biotechnology, IVRI, Izatnagar, India
| | | |
Collapse
|
54
|
Khanal J, Tayara H, Zou Q, Chong KT. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation. Comput Struct Biotechnol J 2021; 19:1612-1619. [PMID: 33868598 PMCID: PMC8042287 DOI: 10.1016/j.csbj.2021.03.015] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2021] [Revised: 03/12/2021] [Accepted: 03/13/2021] [Indexed: 12/11/2022] Open
Abstract
DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.
Collapse
Affiliation(s)
- Jhabindra Khanal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of international Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.,Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
55
|
Auslander N, Gussow AB, Koonin EV. Incorporating Machine Learning into Established Bioinformatics Frameworks. Int J Mol Sci 2021; 22:2903. [PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/08/2021] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open
Abstract
The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges.
Collapse
Affiliation(s)
| | | | - Eugene V. Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
| |
Collapse
|
56
|
Affiliation(s)
- Annapaula Correia
- Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK. .,Department of Veterinary Medicine, University of Cambridge, Cambridge, UK. .,Department of Zoology, University of Oxford, Oxford, UK.
| | - Aaron Weimann
- Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK. .,Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, UK.
| |
Collapse
|
57
|
Charoenkwan P, Nantasenamat C, Hasan MM, Manavalan B, Shoombuatong W. BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics 2021; 37:2556-2562. [PMID: 33638635 DOI: 10.1093/bioinformatics/btab133] [Citation(s) in RCA: 102] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Revised: 02/08/2021] [Accepted: 02/24/2021] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION The identification of bitter peptides through experimental approaches is an expensive and time-consuming endeavor. Due to the huge number of newly available peptide sequences in the post-genomic era, the development of automated computational models for the identification of novel bitter peptides is highly desira-ble. RESULTS In this work, we present BERT4Bitter, a bidirectional encoder representation from transformers (BERT)-based model for predicting bitter peptides directly from their amino acid sequence without using any structural information. To the best of our knowledge, this is the first time a BERT-based model has been employed to identify bitter peptides. Compared to widely used machine learning models, BERT4Bitter achieved the best performance with accuracy of 0.861 and 0.922 for cross-validation and independent tests, respectively. Furthermore, extensive empirical benchmarking experiments on the independent dataset demonstrated that BERT4Bitter clearly outperformed the existing method with improvements of > 8% accuracy and >16% Matthews coefficient correlation, highlighting the effectiveness and robustness of BERT4Bitter. We believe that the BERT4Bitter method proposed herein will be a useful tool for rapidly screening and identifying novel bitter peptides for drug development and nutritional research. AVAILABILITY The user-friendly web server of the proposed BERT4Bitter is freely accessible at: http://pmlab.pythonanywhere.com/BERT4Bitter. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | | | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| |
Collapse
|
58
|
Ali Shah SM, Taju SW, Ho QT, Nguyen TTD, Ou YY. GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models. Comput Biol Med 2021; 131:104259. [PMID: 33581474 DOI: 10.1016/j.compbiomed.2021.104259] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 02/04/2021] [Accepted: 02/04/2021] [Indexed: 12/14/2022]
Abstract
Recently, language representation models have drawn a lot of attention in the field of natural language processing (NLP) due to their remarkable results. Among them, BERT (Bidirectional Encoder Representations from Transformers) has proven to be a simple, yet powerful language model that has achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embeddings to capture the semantics and context in which words appear. We utilized pre-trained BERT models to extract features from protein sequences for discriminating three families of glucose transporters: the major facilitator superfamily of glucose transporters (GLUTs), the sodium-glucose linked transporters (SGLTs), and the sugars will eventually be exported transporters (SWEETs). We treated protein sequences as sentences and transformed them into fixed-length meaningful vectors where a 768- or 1024-dimensional vector represents each amino acid. We observed that BERT-Base and BERT-Large models improved the performance by more than 4% in terms of average sensitivity and Matthews correlation coefficient (MCC), indicating the efficiency of this approach. We also developed a bidirectional transformer-based protein model (TransportersBERT) for comparison with existing pre-trained BERT models.
Collapse
Affiliation(s)
- Syed Muazzam Ali Shah
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan
| | - Semmy Wellem Taju
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan
| | - Quang-Thai Ho
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan
| | | | - Yu-Yen Ou
- Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.
| |
Collapse
|
59
|
Wahab A, Tayara H, Xuan Z, Chong KT. DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine. Sci Rep 2021; 11:212. [PMID: 33420191 PMCID: PMC7794489 DOI: 10.1038/s41598-020-80430-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 12/14/2020] [Indexed: 12/17/2022] Open
Abstract
N4-methylcytosine is a biochemical alteration of DNA that affects the genetic operations without modifying the DNA nucleotides such as gene expression, genomic imprinting, chromosome stability, and the development of the cell. In the proposed work, a computational model, 4mCNLP-Deep, used the word embedding approach as a vector formulation by exploiting deep learning based CNN algorithm to predict 4mC and non-4mC sites on the C.elegans genome dataset. Diversity of ranges employed for the experimental such as corpus k-mer and k-fold cross-validation to obtain the prevailing capabilities. The 4mCNLP-Deep outperform from the state-of-the-art predictor by achieving the results in five evaluation metrics by following; Accuracy (ACC) as 0.9354, Mathew’s correlation coefficient (MCC) as 0.8608, Specificity (Sp) as 0.89.96, Sensitivity (Sn) as 0.9563, and Area under curve (AUC) as 0.9731 by using 3-mer corpus word2vec and 3-fold cross-validation and attained the increment of 1.1%, 0.6%, 0.58%, 0.77%, and 4.89%, respectively. At last, we developed the online webserver http://nsclbio.jbnu.ac.kr/tools/4mCNLP-Deep/, for the experimental researchers to get the results easily.
Collapse
Affiliation(s)
- Abdul Wahab
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Zhenyu Xuan
- Department of Biological Sciences, The University of Texas at Dallas, Richardson, 75080, USA.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea. .,Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju, 54896, South Korea.
| |
Collapse
|
60
|
Nguyen TTD, Le NQK, Tran TA, Pham DM, Ou YY. Incorporating a transfer learning technique with amino acid embeddings to efficiently predict N-linked glycosylation sites in ion channels. Comput Biol Med 2021; 130:104212. [PMID: 33454535 DOI: 10.1016/j.compbiomed.2021.104212] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Revised: 12/21/2020] [Accepted: 01/04/2021] [Indexed: 11/27/2022]
Abstract
Glycosylation is a dynamic enzymatic process that attaches glycan to proteins or other organic molecules such as lipoproteins. Research has shown that such a process in ion channel proteins plays a fundamental role in modulating ion channel functions. This study used a computational method to predict N-linked glycosylation sites, the most common type, in ion channel proteins. From segments of ion channel proteins centered around N-linked glycosylation sites, the amino acid embedding vectors of each residue were concatenated to create features for prediction. We experimented with two different models for converting amino acids to their corresponding embeddings: one was fed with ion channel sequences and the other with a large dataset composed of more than one million protein sequences. The latter model stemmed from the idea of transfer learning technique and emerged as a more efficient feature extractor. Our best model was obtained from this transfer learning approach and a hyperparameter tuning process with a random search on 5-fold cross-validation data. It achieved an accuracy, specificity, sensitivity, and Matthews correlation coefficient of 93.4%, 92.8%, 98.6%, and 0.726, respectively. Corresponding scores on an independent test were 92.9%, 92.2%, 99%, and 0.717. These results outperform the position-specific scoring matrix features that are predominantly employed in post-translational modification site predictions. Furthermore, compared to N-GlyDE, GlycoEP, SPRINT-Gly, the most recent N-linked glycosylation site predictors, our model yields higher scores on the above 4 metrics, thus further demonstrating the efficiency of our approach.
Collapse
Affiliation(s)
| | - Nguyen-Quoc-Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, 106, Taiwan
| | | | - Dinh-Minh Pham
- Institute of Biotechnology, Vietnam Academy of Science and Technology, Hanoi, Viet Nam
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan.
| |
Collapse
|
61
|
Machine-learning Applications to Membrane Active Peptides. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11544-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
|
62
|
Zheng D, Pang G, Liu B, Chen L, Yang J. Learning transferable deep convolutional neural networks for the classification of bacterial virulence factors. Bioinformatics 2020; 36:3693-3702. [PMID: 32251507 DOI: 10.1093/bioinformatics/btaa230] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Revised: 03/25/2020] [Accepted: 04/01/2020] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION Identification of virulence factors (VFs) is critical to the elucidation of bacterial pathogenesis and prevention of related infectious diseases. Current computational methods for VF prediction focus on binary classification or involve only several class(es) of VFs with sufficient samples. However, thousands of VF classes are present in real-world scenarios, and many of them only have a very limited number of samples available. RESULTS We first construct a large VF dataset, covering 3446 VF classes with 160 495 sequences, and then propose deep convolutional neural network models for VF classification. We show that (i) for common VF classes with sufficient samples, our models can achieve state-of-the-art performance with an overall accuracy of 0.9831 and an F1-score of 0.9803; (ii) for uncommon VF classes with limited samples, our models can learn transferable features from auxiliary data and achieve good performance with accuracy ranging from 0.9277 to 0.9512 and F1-score ranging from 0.9168 to 0.9446 when combined with different predefined features, outperforming traditional classifiers by 1-13% in accuracy and by 1-16% in F1-score. AVAILABILITY AND IMPLEMENTATION All of our datasets are made publicly available at http://www.mgc.ac.cn/VFNet/, and the source code of our models is publicly available at https://github.com/zhengdd0422/VFNet. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dandan Zheng
- NHC Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100176, China
| | - Guansong Pang
- Australian Institute for Machine Learning, The University of Adelaide, Adelaide, SA 5005, Australia
| | - Bo Liu
- NHC Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100176, China
| | - Lihong Chen
- NHC Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100176, China
| | - Jian Yang
- NHC Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100176, China
| |
Collapse
|
63
|
Puentes PR, Henao MC, Torres CE, Gómez SC, Gómez LA, Burgos JC, Arbeláez P, Osma JF, Muñoz-Camargo C, Reyes LH, Cruz JC. Design, Screening, and Testing of Non-Rational Peptide Libraries with Antimicrobial Activity: In Silico and Experimental Approaches. Antibiotics (Basel) 2020; 9:E854. [PMID: 33265897 PMCID: PMC7759991 DOI: 10.3390/antibiotics9120854] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 11/20/2020] [Accepted: 11/23/2020] [Indexed: 12/13/2022] Open
Abstract
One of the challenges of modern biotechnology is to find new routes to mitigate the resistance to conventional antibiotics. Antimicrobial peptides (AMPs) are an alternative type of biomolecules, naturally present in a wide variety of organisms, with the capacity to overcome the current microorganism resistance threat. Here, we reviewed our recent efforts to develop a new library of non-rationally produced AMPs that relies on bacterial genome inherent diversity and compared it with rationally designed libraries. Our approach is based on a four-stage workflow process that incorporates the interplay of recent developments in four major emerging technologies: artificial intelligence, molecular dynamics, surface-display in microorganisms, and microfluidics. Implementing this framework is challenging because to obtain reliable results, the in silico algorithms to search for candidate AMPs need to overcome issues of the state-of-the-art approaches that limit the possibilities for multi-space data distribution analyses in extremely large databases. We expect to tackle this challenge by using a recently developed classification algorithm based on deep learning models that rely on convolutional layers and gated recurrent units. This will be complemented by carefully tailored molecular dynamics simulations to elucidate specific interactions with lipid bilayers. Candidate AMPs will be recombinantly-expressed on the surface of microorganisms for further screening via different droplet-based microfluidic-based strategies to identify AMPs with the desired lytic abilities. We believe that the proposed approach opens opportunities for searching and screening bioactive peptides for other applications.
Collapse
Affiliation(s)
- Paola Ruiz Puentes
- Center for Research and Formation in Artificial Intelligence, Universidad de los Andes, Bogota DC 111711, Colombia; (P.R.P.); (P.A.)
- Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
| | - María C. Henao
- Grupo de Diseño de Productos y Procesos, Department of Chemical and Food Engineering, Universidad de los Andes, Bogota DC 111711, Colombia;
| | - Carlos E. Torres
- Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
| | - Saúl C. Gómez
- Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
| | - Laura A. Gómez
- Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
| | - Juan C. Burgos
- Chemical Engineering Program, Universidad de Cartagena, Cartagena 130015, Colombia;
| | - Pablo Arbeláez
- Center for Research and Formation in Artificial Intelligence, Universidad de los Andes, Bogota DC 111711, Colombia; (P.R.P.); (P.A.)
- Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
| | - Johann F. Osma
- Department of Electrical and Electronic Engineering, Universidad de los Andes, Bogota DC 111711, Colombia;
| | - Carolina Muñoz-Camargo
- Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
| | - Luis H. Reyes
- Grupo de Diseño de Productos y Procesos, Department of Chemical and Food Engineering, Universidad de los Andes, Bogota DC 111711, Colombia;
| | - Juan C. Cruz
- Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
- School of Chemical Engineering and Advanced Materials, The University of Adelaide, Adelaide 5005, Australia
| |
Collapse
|
64
|
Huan Y, Kong Q, Mou H, Yi H. Antimicrobial Peptides: Classification, Design, Application and Research Progress in Multiple Fields. Front Microbiol 2020; 11:582779. [PMID: 33178164 PMCID: PMC7596191 DOI: 10.3389/fmicb.2020.582779] [Citation(s) in RCA: 785] [Impact Index Per Article: 157.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 09/23/2020] [Indexed: 12/12/2022] Open
Abstract
Antimicrobial peptides (AMPs) are a class of small peptides that widely exist in nature and they are an important part of the innate immune system of different organisms. AMPs have a wide range of inhibitory effects against bacteria, fungi, parasites and viruses. The emergence of antibiotic-resistant microorganisms and the increasing of concerns about the use of antibiotics resulted in the development of AMPs, which have a good application prospect in medicine, food, animal husbandry, agriculture and aquaculture. This review introduces the progress of research on AMPs comprehensively and systematically, including their classification, mechanism of action, design methods, environmental factors affecting their activity, application status, prospects in various fields and problems to be solved. The research progress on antivirus peptides, especially anti-coronavirus (COVID-19) peptides, has been introduced given the COVID-19 pandemic worldwide in 2020.
Collapse
Affiliation(s)
| | - Qing Kong
- College of Food Science and Engineering, Ocean University of China, Qingdao, China
| | | | | |
Collapse
|
65
|
Wang H, Wang Z, Li Z, Lee TY. Incorporating Deep Learning With Word Embedding to Identify Plant Ubiquitylation Sites. Front Cell Dev Biol 2020; 8:572195. [PMID: 33102477 PMCID: PMC7554246 DOI: 10.3389/fcell.2020.572195] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Accepted: 08/24/2020] [Indexed: 12/17/2022] Open
Abstract
Protein ubiquitylation is an important posttranslational modification (PTM), which is involved in diverse biological processes and plays an essential role in the regulation of physiological mechanisms and diseases. The Protein Lysine Modifications Database (PLMD) has accumulated abundant ubiquitylated proteins with their substrate sites for more than 20 kinds of species. Numerous works have consequently developed a variety of ubiquitylation site prediction tools across all species, mainly relying on the predefined sequence features and machine learning algorithms. However, the difference in ubiquitylated patterns between these species stays unclear. In this work, the sequence-based characterization of ubiquitylated substrate sites has revealed remarkable differences among plants, animals, and fungi. Then an improved word-embedding scheme based on the transfer learning strategy was incorporated with the multilayer convolutional neural network (CNN) for identifying protein ubiquitylation sites. For the prediction of plant ubiquitylation sites, the proposed deep learning scheme could outperform the machine learning-based methods, with the accuracy of 75.6%, precision of 73.3%, recall of 76.7%, F-score of 0.7493, and 0.82 AUC on the independent testing set. Although the ubiquitylated specificity of substrate sites is complicated, this work has demonstrated that the application of the word-embedding method can enable the extraction of informative features and help the identification of ubiquitylated sites. To accelerate the investigation of protein ubiquitylation, the data sets and source code used in this study are freely available at https://github.com/wang-hong-fei/DL-plant-ubsites-prediction.
Collapse
Affiliation(s)
- Hongfei Wang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China
| | - Zhuo Wang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.,School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Zhongyan Li
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.,School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, China
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.,School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, China
| |
Collapse
|
66
|
Carpenter K, Pilozzi A, Huang X. A Pilot Study of Multi-Input Recurrent Neural Networks for Drug-Kinase Binding Prediction. Molecules 2020; 25:E3372. [PMID: 32722290 PMCID: PMC7435591 DOI: 10.3390/molecules25153372] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 07/20/2020] [Accepted: 07/24/2020] [Indexed: 01/01/2023] Open
Abstract
The use of virtual drug screening can be beneficial to research teams, enabling them to narrow down potentially useful compounds for further study. A variety of virtual screening methods have been developed, typically with machine learning classifiers at the center of their design. In the present study, we created a virtual screener for protein kinase inhibitors. Experimental compound-target interaction data were obtained from the IDG-DREAM Drug-Kinase Binding Prediction Challenge. These data were converted and fed as inputs into two multi-input recurrent neural networks (RNNs). The first network utilized data encoded in one-hot representation, while the other incorporated embedding layers. The models were developed in Python, and were designed to output the IC50 of the target compounds. The performance of the models was assessed primarily through analysis of the Q2 values produced from runs of differing sample and epoch size; recorded loss values were also reported and graphed. The performance of the models was limited, though multiple changes are proposed for potential improvement of a multi-input recurrent neural network-based screening tool.
Collapse
Affiliation(s)
| | | | - Xudong Huang
- Neurochemistry Laboratory, Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Charlestown, MA 02129, USA; (K.C.); (A.P.)
| |
Collapse
|
67
|
Liu Z, Chen Q, Lan W, Liang J, Chen YPP, Chen B. A Survey of Network Embedding for Drug Analysis and Prediction. Curr Protein Pept Sci 2020; 22:CPPS-EPUB-107859. [PMID: 32614745 DOI: 10.2174/1389203721666200702145701] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Revised: 04/05/2020] [Accepted: 05/21/2020] [Indexed: 11/22/2022]
Abstract
Traditional network-based computational methods have shown good results in drug analysis and prediction. However, these methods are time consuming and lack universality, and it is difficult to exploit the auxiliary information of nodes and edges. Network embedding provides a promising way for alleviating the above problems by transforming network into a low-dimensional space while preserving network structure and auxiliary information. This thus facilitates the application of machine learning algorithms for subsequent processing. Network embedding has been introduced into drug analysis and prediction in the last few years, and has shown superior performance over traditional methods. However, there is no systematic review of this issue. This article offers a comprehensive survey of the primary network embedding methods and their applications in drug analysis and prediction. The network embedding technologies applied in homogeneous network and heterogeneous network are investigated and compared, including matrix decomposition, random walk, and deep learning. Especially, the Graph neural network (GNN) methods in deep learning are highlighted. Further, the applications of network embedding in drug similarity estimation, drug-target interaction prediction, adverse drug reactions prediction, protein function and therapeutic peptides prediction are discussed. Several future potential research directions are also discussed.
Collapse
Affiliation(s)
- Zhixian Liu
- School of Medical, Guangxi University, Nanning. China
| | - Qingfeng Chen
- School of Computer, Electronic and Information, Guangxi University, Nanning. China
| | - Wei Lan
- School of Computer, Electronic and Information, Guangxi University, Nanning. China
| | - Jiahai Liang
- School of Electronics and Information Engineering, Beibu Gulf University, Qinzhou. China
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University, Melbourne. Australia
| | - Baoshan Chen
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, Guangxi University, Nanning. China
| |
Collapse
|
68
|
Xie R, Li J, Wang J, Dai W, Leier A, Marquez-Lago TT, Akutsu T, Lithgow T, Song J, Zhang Y. DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Brief Bioinform 2020; 22:5864586. [PMID: 32599617 DOI: 10.1093/bib/bbaa125] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 05/22/2020] [Accepted: 05/22/2020] [Indexed: 12/14/2022] Open
Abstract
Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user's viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.
Collapse
Affiliation(s)
- Ruopeng Xie
- Bioinformatics Lab at Guilin University of Electronic Technology
| | - Jiahui Li
- Bioinformatics Lab at Guilin University of Electronic Technology
| | - Jiawei Wang
- Biomedicine Discovery Institute and the Department of Microbiology at Monash University, Australia
| | - Wei Dai
- School of Computer Science and Information Security, Guilin University of Electronic Technology, China
| | - André Leier
- Department of Genetics and the Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham (UAB) School of Medicine, USA
| | - Tatiana T Marquez-Lago
- Department of Genetics and the Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham (UAB) School of Medicine, USA
| | | | - Trevor Lithgow
- Biomedicine Discovery Institute and the Director of the Centre to Impact AMR at Monash University, Australia
| | - Jiangning Song
- Group Leader in the Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Yanju Zhang
- Leiden Institute of Advanced Computer Science, Leiden University
| |
Collapse
|
69
|
Xie W, Luo J, Pan C, Liu Y. SG-LSTM-FRAME: a computational frame using sequence and geometrical information via LSTM to predict miRNA-gene associations. Brief Bioinform 2020; 22:2032-2042. [PMID: 32181478 DOI: 10.1093/bib/bbaa022] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 02/10/2020] [Accepted: 02/11/2020] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION MircroRNAs (miRNAs) regulate target genes and are responsible for lethal diseases such as cancers. Accurately recognizing and identifying miRNA and gene pairs could be helpful in deciphering the mechanism by which miRNA affects and regulates the development of cancers. Embedding methods and deep learning methods have shown their excellent performance in traditional classification tasks in many scenarios. But not so many attempts have adapted and merged these two methods into miRNA-gene relationship prediction. Hence, we proposed a novel computational framework. We first generated representational features for miRNAs and genes using both sequence and geometrical information and then leveraged a deep learning method for the associations' prediction. RESULTS We used long short-term memory (LSTM) to predict potential relationships and proved that our method outperformed other state-of-the-art methods. Results showed that our framework SG-LSTM got an area under curve of 0.94 and was superior to other methods. In the case study, we predicted the top 10 miRNA-gene relationships and recommended the top 10 potential genes for hsa-miR-335-5p for SG-LSTM-core. We also tested our model using a larger dataset, from which 14 668 698 miRNA-gene pairs were predicted. The top 10 unknown pairs were also listed. AVAILABILITY Our work can be download in https://github.com/Xshelton/SG_LSTM. CONTACT luojiawei@hnu.edu.cn. SUPPLEMENTARY INFORMATION Supplementary data are available at Briefings in Bioinformatics online.
Collapse
Affiliation(s)
- Weidun Xie
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, Hunan, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, Hunan, China
| | - Chu Pan
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, Hunan, China
| | - Ying Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, Hunan, China
| |
Collapse
|
70
|
Grisoni F, Moret M, Lingwood R, Schneider G. Bidirectional Molecule Generation with Recurrent Neural Networks. J Chem Inf Model 2020; 60:1175-1183. [PMID: 31904964 DOI: 10.1021/acs.jcim.9b00943] [Citation(s) in RCA: 104] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Recurrent neural networks (RNNs) are able to generate de novo molecular designs using simplified molecular input line entry systems (SMILES) string representations of the chemical structure. RNN-based structure generation is usually performed unidirectionally, by growing SMILES strings from left to right. However, there is no natural start or end of a small molecule, and SMILES strings are intrinsically nonunivocal representations of molecular graphs. These properties motivate bidirectional structure generation. Here, bidirectional generative RNNs for SMILES-based molecule design are introduced. To this end, two established bidirectional methods were implemented, and a new method for SMILES string generation and data augmentation is introduced-the bidirectional molecule design by alternate learning (BIMODAL). These three bidirectional strategies were compared to the unidirectional forward RNN approach for SMILES string generation, in terms of the (i) novelty, (ii) scaffold diversity, and (iii) chemical-biological relevance of the computer-generated molecules. The results positively advocate bidirectional strategies for SMILES-based molecular de novo design, with BIMODAL showing superior results to the unidirectional forward RNN for most of the criteria in the tested conditions. The code of the methods and the pretrained models can be found at URL https://github.com/ETHmodlab/BIMODAL.
Collapse
Affiliation(s)
- Francesca Grisoni
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, Vladimir-Prelog-Weg 4, 8093 Zurich, Switzerland
| | - Michael Moret
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, Vladimir-Prelog-Weg 4, 8093 Zurich, Switzerland
| | - Robin Lingwood
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, Vladimir-Prelog-Weg 4, 8093 Zurich, Switzerland
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, Vladimir-Prelog-Weg 4, 8093 Zurich, Switzerland
| |
Collapse
|
71
|
Le NQK, Yapp EKY, Nagasundaram N, Yeh HY. Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams. Front Bioeng Biotechnol 2019; 7:305. [PMID: 31750297 PMCID: PMC6848157 DOI: 10.3389/fbioe.2019.00305] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 10/17/2019] [Indexed: 01/16/2023] Open
Abstract
A promoter is a short region of DNA (100-1,000 bp) where transcription of a gene by RNA polymerase begins. It is typically located directly upstream or at the 5' end of the transcription initiation site. DNA promoter has been proven to be the primary cause of many human diseases, especially diabetes, cancer, or Huntington's disease. Therefore, classifying promoters has become an interesting problem and it has attracted the attention of a lot of researchers in the bioinformatics field. There were a variety of studies conducted to resolve this problem, however, their performance results still require further improvement. In this study, we will present an innovative approach by interpreting DNA sequences as a combination of continuous FastText N-grams, which are then fed into a deep neural network in order to classify them. Our approach is able to attain a cross-validation accuracy of 85.41 and 73.1% in the two layers, respectively. Our results outperformed the state-of-the-art methods on the same dataset, especially in the second layer (strength classification). Throughout this study, promoter regions could be identified with high accuracy and it provides analysis for further biological research as well as precision medicine. In addition, this study opens new paths for the natural language processing application in omics data in general and DNA sequences in particular.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | | | - N. Nagasundaram
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, Singapore, Singapore
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
72
|
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci Rep 2019; 9:3577. [PMID: 30837494 PMCID: PMC6401088 DOI: 10.1038/s41598-019-38746-w] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 12/19/2018] [Indexed: 12/28/2022] Open
Abstract
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
Collapse
|
73
|
Le NQK, Yapp EKY, Ho QT, Nagasundaram N, Ou YY, Yeh HY. iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. Anal Biochem 2019; 571:53-61. [PMID: 30822398 DOI: 10.1016/j.ab.2019.02.017] [Citation(s) in RCA: 88] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Revised: 02/17/2019] [Accepted: 02/19/2019] [Indexed: 12/22/2022]
Abstract
An enhancer is a short (50-1500bp) region of DNA that plays an important role in gene expression and the production of RNA and proteins. Genetic variation in enhancers has been linked to many human diseases, such as cancer, disorder or inflammatory bowel disease. Due to the importance of enhancers in genomics, the classification of enhancers has become a popular area of research in computational biology. Despite the few computational tools employed to address this problem, their resulting performance still requires improvements. In this study, we treat enhancers by the word embeddings, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to classify them. We present iEnhancer-5Step, a web server containing two-layer classifiers to identify enhancers and their strength. We are able to attain an independent test accuracy of 79% and 63.5% in the two layers, respectively. Compared to current predictors on the same dataset, our proposed method is able to yield superior performance as compared to the other methods. Moreover, this study provides a basis for further research that can enrich the field of applying natural language processing techniques in biological sequences. iEnhancer-5Step is freely accessible via http://biologydeep.com/fastenc/.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore.
| | - Edward Kien Yee Yapp
- Singapore Institute of Manufacturing Technology, 2 Fusionopolis Way, #08-04, Innovis, 138634, Singapore
| | - Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, 32003, Taiwan
| | - N Nagasundaram
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, 32003, Taiwan
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore.
| |
Collapse
|