Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Hamid MN, Friedberg I. Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. Bioinformatics 2019;35:2009-2016. [PMID: 30418485 PMCID: PMC6581433 DOI: 10.1093/bioinformatics/bty937] [Citation(s) in RCA: 72] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Revised: 08/27/2018] [Accepted: 11/08/2018] [Indexed: 12/11/2022] Open

For:	Hamid MN, Friedberg I. Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. Bioinformatics 2019;35:2009-2016. [PMID: 30418485 PMCID: PMC6581433 DOI: 10.1093/bioinformatics/bty937] [Citation(s) in RCA: 72] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Revised: 08/27/2018] [Accepted: 11/08/2018] [Indexed: 12/11/2022] Open

Number

Cited by Other Article(s)

ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations. Comput Biol Chem 2021;93:107537. [PMID: 34217007 DOI: 10.1016/j.compbiolchem.2021.107537] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 05/09/2021] [Accepted: 06/26/2021] [Indexed: 01/08/2023]

Abstract

MOTIVATION

Primary and secondary active transport are two types of active transport that involve using energy to move the substances. Active transport mechanisms do use proteins to assist in transport and play essential roles to regulate the traffic of ions or small molecules across a cell membrane against the concentration gradient. In this study, the two main types of proteins involved in such transport are classified from transmembrane transport proteins. We propose a Support Vector Machine (SVM) with contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) to represent protein sequences. BERT is a powerful model in transfer learning, a deep learning language representation model developed by Google and one of the highest performing pre-trained model for Natural Language Processing (NLP) tasks. The idea of transfer learning with pre-trained model from BERT is applied to extract fixed feature vectors from the hidden layers and learn contextual relations between amino acids in the protein sequence. Therefore, the contextualized word representations of proteins are introduced to effectively model complex structures of amino acids in the sequence and the variations of these amino acids in the context. By generating context information, we capture multiple meanings for the same amino acid to reveal the importance of specific residues in the protein sequence.

RESULTS

The performance of the proposed method is evaluated using five-fold cross-validation and independent test. The proposed method achieves an accuracy of 85.44 %, 88.74 % and 92.84 % for Class-1, Class-2, and Class-3, respectively. Experimental results show that this approach can outperform from other feature extraction methods using context information, effectively classify two types of active transport and improve the overall performance.

Collapse

Littmann M, Bordin N, Heinzinger M, Schütze K, Dallago C, Orengo C, Rost B. Clustering FunFams using sequence embeddings improves EC purity. Bioinformatics 2021;37:3449-3455. [PMID: 33978744 PMCID: PMC8545299 DOI: 10.1093/bioinformatics/btab371] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 04/02/2021] [Accepted: 05/11/2021] [Indexed: 12/05/2022] Open

Abstract

Motivation

Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations.

Results

We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes.

Availability and implementation

Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Sharma R, Shrivastava S, Kumar Singh S, Kumar A, Saxena S, Kumar Singh R. Deep-ABPpred: identifying antibacterial peptides in protein sequences using bidirectional LSTM with word2vec. Brief Bioinform 2021;22:6204762. [PMID: 33784381 DOI: 10.1093/bib/bbab065] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 02/04/2021] [Indexed: 12/13/2022] Open

Khanal J, Tayara H, Zou Q, Chong KT. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation. Comput Struct Biotechnol J 2021;19:1612-1619. [PMID: 33868598 PMCID: PMC8042287 DOI: 10.1016/j.csbj.2021.03.015] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2021] [Revised: 03/12/2021] [Accepted: 03/13/2021] [Indexed: 12/11/2022] Open

Auslander N, Gussow AB, Koonin EV. Incorporating Machine Learning into Established Bioinformatics Frameworks. Int J Mol Sci 2021;22:2903. [PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/08/2021] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open

Correia A, Weimann A. Protein antibiotics: mind your language. Nat Rev Microbiol 2021;19:7. [PMID: 33219332 DOI: 10.1038/s41579-020-00485-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]

Charoenkwan P, Nantasenamat C, Hasan MM, Manavalan B, Shoombuatong W. BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics 2021;37:2556-2562. [PMID: 33638635 DOI: 10.1093/bioinformatics/btab133] [Citation(s) in RCA: 102] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Revised: 02/08/2021] [Accepted: 02/24/2021] [Indexed: 12/11/2022] Open

Ali Shah SM, Taju SW, Ho QT, Nguyen TTD, Ou YY. GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models. Comput Biol Med 2021;131:104259. [PMID: 33581474 DOI: 10.1016/j.compbiomed.2021.104259] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 02/04/2021] [Accepted: 02/04/2021] [Indexed: 12/14/2022]

Wahab A, Tayara H, Xuan Z, Chong KT. DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine. Sci Rep 2021;11:212. [PMID: 33420191 PMCID: PMC7794489 DOI: 10.1038/s41598-020-80430-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 12/14/2020] [Indexed: 12/17/2022] Open

Nguyen TTD, Le NQK, Tran TA, Pham DM, Ou YY. Incorporating a transfer learning technique with amino acid embeddings to efficiently predict N-linked glycosylation sites in ion channels. Comput Biol Med 2021;130:104212. [PMID: 33454535 DOI: 10.1016/j.compbiomed.2021.104212] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Revised: 12/21/2020] [Accepted: 01/04/2021] [Indexed: 11/27/2022]

Machine-learning Applications to Membrane Active Peptides. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11544-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open

Zheng D, Pang G, Liu B, Chen L, Yang J. Learning transferable deep convolutional neural networks for the classification of bacterial virulence factors. Bioinformatics 2020;36:3693-3702. [PMID: 32251507 DOI: 10.1093/bioinformatics/btaa230] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Revised: 03/25/2020] [Accepted: 04/01/2020] [Indexed: 12/23/2022] Open

Puentes PR, Henao MC, Torres CE, Gómez SC, Gómez LA, Burgos JC, Arbeláez P, Osma JF, Muñoz-Camargo C, Reyes LH, Cruz JC. Design, Screening, and Testing of Non-Rational Peptide Libraries with Antimicrobial Activity: In Silico and Experimental Approaches. Antibiotics (Basel) 2020;9:E854. [PMID: 33265897 PMCID: PMC7759991 DOI: 10.3390/antibiotics9120854] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 11/20/2020] [Accepted: 11/23/2020] [Indexed: 12/13/2022] Open

Affiliation(s)

Paola Ruiz Puentes Center for Research and Formation in Artificial Intelligence, Universidad de los Andes, Bogota DC 111711, Colombia; (P.R.P.); (P.A.) Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
María C. Henao Grupo de Diseño de Productos y Procesos, Department of Chemical and Food Engineering, Universidad de los Andes, Bogota DC 111711, Colombia;
Carlos E. Torres Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
Saúl C. Gómez Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
Laura A. Gómez Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
Juan C. Burgos Chemical Engineering Program, Universidad de Cartagena, Cartagena 130015, Colombia;
Pablo Arbeláez Center for Research and Formation in Artificial Intelligence, Universidad de los Andes, Bogota DC 111711, Colombia; (P.R.P.); (P.A.) Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
Johann F. Osma Department of Electrical and Electronic Engineering, Universidad de los Andes, Bogota DC 111711, Colombia;
Carolina Muñoz-Camargo Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.)
Luis H. Reyes Grupo de Diseño de Productos y Procesos, Department of Chemical and Food Engineering, Universidad de los Andes, Bogota DC 111711, Colombia;
Juan C. Cruz Department of Biomedical Engineering, Universidad de los Andes, Bogota DC 111711, Colombia; (C.E.T.); (S.C.G.); (L.A.G.); (C.M.-C.) School of Chemical Engineering and Advanced Materials, The University of Adelaide, Adelaide 5005, Australia

Collapse

Huan Y, Kong Q, Mou H, Yi H. Antimicrobial Peptides: Classification, Design, Application and Research Progress in Multiple Fields. Front Microbiol 2020;11:582779. [PMID: 33178164 PMCID: PMC7596191 DOI: 10.3389/fmicb.2020.582779] [Citation(s) in RCA: 785] [Impact Index Per Article: 157.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 09/23/2020] [Indexed: 12/12/2022] Open

Wang H, Wang Z, Li Z, Lee TY. Incorporating Deep Learning With Word Embedding to Identify Plant Ubiquitylation Sites. Front Cell Dev Biol 2020;8:572195. [PMID: 33102477 PMCID: PMC7554246 DOI: 10.3389/fcell.2020.572195] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Accepted: 08/24/2020] [Indexed: 12/17/2022] Open

Carpenter K, Pilozzi A, Huang X. A Pilot Study of Multi-Input Recurrent Neural Networks for Drug-Kinase Binding Prediction. Molecules 2020;25:E3372. [PMID: 32722290 PMCID: PMC7435591 DOI: 10.3390/molecules25153372] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 07/20/2020] [Accepted: 07/24/2020] [Indexed: 01/01/2023] Open

Liu Z, Chen Q, Lan W, Liang J, Chen YPP, Chen B. A Survey of Network Embedding for Drug Analysis and Prediction. Curr Protein Pept Sci 2020;22:CPPS-EPUB-107859. [PMID: 32614745 DOI: 10.2174/1389203721666200702145701] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Revised: 04/05/2020] [Accepted: 05/21/2020] [Indexed: 11/22/2022]

Xie R, Li J, Wang J, Dai W, Leier A, Marquez-Lago TT, Akutsu T, Lithgow T, Song J, Zhang Y. DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Brief Bioinform 2020;22:5864586. [PMID: 32599617 DOI: 10.1093/bib/bbaa125] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 05/22/2020] [Accepted: 05/22/2020] [Indexed: 12/14/2022] Open

Abstract

Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user's viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.

Collapse

Xie W, Luo J, Pan C, Liu Y. SG-LSTM-FRAME: a computational frame using sequence and geometrical information via LSTM to predict miRNA-gene associations. Brief Bioinform 2020;22:2032-2042. [PMID: 32181478 DOI: 10.1093/bib/bbaa022] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 02/10/2020] [Accepted: 02/11/2020] [Indexed: 12/19/2022] Open

Grisoni F, Moret M, Lingwood R, Schneider G. Bidirectional Molecule Generation with Recurrent Neural Networks. J Chem Inf Model 2020;60:1175-1183. [PMID: 31904964 DOI: 10.1021/acs.jcim.9b00943] [Citation(s) in RCA: 104] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]

Le NQK, Yapp EKY, Nagasundaram N, Yeh HY. Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams. Front Bioeng Biotechnol 2019;7:305. [PMID: 31750297 PMCID: PMC6848157 DOI: 10.3389/fbioe.2019.00305] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 10/17/2019] [Indexed: 01/16/2023] Open

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci Rep 2019;9:3577. [PMID: 30837494 PMCID: PMC6401088 DOI: 10.1038/s41598-019-38746-w] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 12/19/2018] [Indexed: 12/28/2022] Open

Abstract

In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.

Collapse

Le NQK, Yapp EKY, Ho QT, Nagasundaram N, Ou YY, Yeh HY. iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. Anal Biochem 2019;571:53-61. [PMID: 30822398 DOI: 10.1016/j.ab.2019.02.017] [Citation(s) in RCA: 88] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Revised: 02/17/2019] [Accepted: 02/19/2019] [Indexed: 12/22/2022]