1
|
Zhang Y, Guan J, Li C, Wang Z, Deng Z, Gasser RB, Song J, Ou HY. DeepSecE: A Deep-Learning-Based Framework for Multiclass Prediction of Secreted Proteins in Gram-Negative Bacteria. RESEARCH (WASHINGTON, D.C.) 2023; 6:0258. [PMID: 37886621 PMCID: PMC10599158 DOI: 10.34133/research.0258] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 10/08/2023] [Indexed: 10/28/2023]
Abstract
Proteins secreted by Gram-negative bacteria are tightly linked to the virulence and adaptability of these microbes to environmental changes. Accurate identification of such secreted proteins can facilitate the investigations of infections and diseases caused by these bacterial pathogens. However, current bioinformatic methods for predicting bacterial secreted substrate proteins have limited computational efficiency and application scope on a genome-wide scale. Here, we propose a novel deep-learning-based framework-DeepSecE-for the simultaneous inference of multiple distinct groups of secreted proteins produced by Gram-negative bacteria. DeepSecE remarkably improves their classification from nonsecreted proteins using a pretrained protein language model and transformer, achieving a macro-average accuracy of 0.883 on 5-fold cross-validation. Performance benchmarking suggests that DeepSecE achieves competitive performance with the state-of-the-art binary predictors specialized for individual types of secreted substrates. The attention mechanism corroborates salient patterns and motifs at the N or C termini of the protein sequences. Using this pipeline, we further investigate the genome-wide prediction of novel secreted proteins and their taxonomic distribution across ~1,000 Gram-negative bacterial genomes. The present analysis demonstrates that DeepSecE has major potential for the discovery of disease-associated secreted proteins in a diverse range of Gram-negative bacteria. An online web server of DeepSecE is also publicly available to predict and explore various secreted substrate proteins via the input of bacterial genome sequences.
Collapse
Affiliation(s)
- Yumeng Zhang
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai Key Laboratory of Veterinary Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
| | - Jiahao Guan
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology,
Monash University, Melbourne, VIC 3800, Australia
| | - Zhikang Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology,
Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute,
Monash University, Melbourne, VIC 3800, Australia
| | - Zixin Deng
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
| | - Robin B. Gasser
- Melbourne Veterinary School, Faculty of Science,
The University of Melbourne, Parkville, VIC 3010, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology,
Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute,
Monash University, Melbourne, VIC 3800, Australia
- Melbourne Veterinary School, Faculty of Science,
The University of Melbourne, Parkville, VIC 3010, Australia
| | - Hong-Yu Ou
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai Key Laboratory of Veterinary Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
| |
Collapse
|
2
|
Todd JNA, Carreón-Anguiano KG, Islas-Flores I, Canto-Canché B. Microbial Effectors: Key Determinants in Plant Health and Disease. Microorganisms 2022; 10:1980. [PMID: 36296254 PMCID: PMC9610748 DOI: 10.3390/microorganisms10101980] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 09/27/2022] [Accepted: 09/29/2022] [Indexed: 11/17/2022] Open
Abstract
Effectors are small, secreted molecules that alter host cell structure and function, thereby facilitating infection or triggering a defense response. Effectoromics studies have focused on effectors in plant-pathogen interactions, where their contributions to virulence are determined in the plant host, i.e., whether the effector induces resistance or susceptibility to plant disease. Effector molecules from plant pathogenic microorganisms such as fungi, oomycetes and bacteria are major disease determinants. Interestingly, the effectors of non-pathogenic plant organisms such as endophytes display similar functions but have different outcomes for plant health. Endophyte effectors commonly aid in the establishment of mutualistic interactions with the plant and contribute to plant health through the induction of systemic resistance against pathogens, while pathogenic effectors mainly debilitate the plant's immune response, resulting in the establishment of disease. Effectors of plant pathogens as well as plant endophytes are tools to be considered in effectoromics for the development of novel strategies for disease management. This review aims to present effectors in their roles as promotors of health or disease for the plant host.
Collapse
Affiliation(s)
- Jewel Nicole Anna Todd
- Unidad de Biotecnología, Centro de Investigación Científica de Yucatán, A.C., Calle 43 No. 130 x 32 y 34, Colonia Chuburná de Hidalgo, Mérida C.P. 97205, Yucatán, Mexico
| | - Karla Gisel Carreón-Anguiano
- Unidad de Biotecnología, Centro de Investigación Científica de Yucatán, A.C., Calle 43 No. 130 x 32 y 34, Colonia Chuburná de Hidalgo, Mérida C.P. 97205, Yucatán, Mexico
| | - Ignacio Islas-Flores
- Unidad de Bioquímica y Biología Molecular de Plantas, Centro de Investigación Científica de Yucatán, A.C., Calle 43 No. 130 x 32 y 34, Colonia Chuburná de Hidalgo, Mérida C.P. 97205, Yucatán, Mexico
| | - Blondy Canto-Canché
- Unidad de Biotecnología, Centro de Investigación Científica de Yucatán, A.C., Calle 43 No. 130 x 32 y 34, Colonia Chuburná de Hidalgo, Mérida C.P. 97205, Yucatán, Mexico
| |
Collapse
|
3
|
Protein-Specific Prediction of RNA-Binding Sites Based on Information Entropy. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:8626628. [PMID: 36225547 PMCID: PMC9550406 DOI: 10.1155/2022/8626628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 09/15/2022] [Accepted: 09/20/2022] [Indexed: 11/25/2022]
Abstract
Understanding the protein-RNA interaction mechanism can help us to further explore various biological processes. The experimental techniques still have some limitations, such as the high cost of economy and time. Predicting protein-RNA-binding sites by using computational methods is an excellent research tool. Here, we developed a universal method for predicting protein-specific RNA-binding sites, so one general model for a given protein was constructed on a fixed dataset by fusing the data of different experimental techniques. At the same time, information theory was employed to characterize the sequence conservation of RNA-binding segments. Conversation difference profiles between binding and nonbinding segments were constructed by information entropy (IE), which indicates a significant difference. Finally, the 19 proteins-specific models based on random forest (RF) were built based on IE encoding. The performance on the independent datasets demonstrates that our method can obtain competitive results when compared with the current best prediction model.
Collapse
|
4
|
Zhang Y, Zhang Y, Xiong Y, Wang H, Deng Z, Song J, Ou HY. T4SEfinder: a bioinformatics tool for genome-scale prediction of bacterial type IV secreted effectors using pre-trained protein language model. Brief Bioinform 2021; 23:6397152. [PMID: 34657153 DOI: 10.1093/bib/bbab420] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 08/31/2021] [Accepted: 09/13/2021] [Indexed: 11/12/2022] Open
Abstract
Bacterial type IV secretion systems (T4SSs) are versatile and membrane-spanning apparatuses, which mediate both genetic exchange and delivery of effector proteins to target eukaryotic cells. The secreted effectors (T4SEs) can affect gene expression and signal transduction of the host cells. As such, they often function as virulence factors and play an important role in bacterial pathogenesis. Nowadays, T4SE prediction tools have utilized various machine learning algorithms, but the accuracy and speed of these tools remain to be improved. In this study, we apply a sequence embedding strategy from a pre-trained language model of protein sequences (TAPE) to the classification task of T4SEs. The training dataset is mainly derived from our updated type IV secretion system database SecReT4 with newly experimentally verified T4SEs. An online web server termed T4SEfinder is developed using TAPE and a multi-layer perceptron (MLP) for T4SE prediction after a comprehensive performance comparison with several candidate models, which achieves a slightly higher level of accuracy than the existing prediction tools. It only takes about 3 minutes to make a classification for 5000 protein sequences by T4SEfinder so that the computational speed is qualified for whole genome-scale T4SEs detection in pathogenic bacteria. T4SEfinder might contribute to meet the increasing demands of re-annotating secretion systems and effector proteins in sequenced bacterial genomes. T4SEfinder is freely accessible at https://tool2-mml.sjtu.edu.cn/T4SEfinder_TAPE/.
Collapse
Affiliation(s)
- Yumeng Zhang
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences & Biotechnology, Shanghai Jiao Tong University, Shanghai 200030, China
| | - Yangming Zhang
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences & Biotechnology, Shanghai Jiao Tong University, Shanghai 200030, China
| | - Yi Xiong
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences & Biotechnology, Shanghai Jiao Tong University, Shanghai 200030, China
| | - Hui Wang
- State Key Laboratory of Pathogens and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing 100071, China
| | - Zixin Deng
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences & Biotechnology, Shanghai Jiao Tong University, Shanghai 200030, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Hong-Yu Ou
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences & Biotechnology, Shanghai Jiao Tong University, Shanghai 200030, China.,Shanghai Key Laboratory of Veterinary Biotechnology, Shanghai Jiao Tong University, Shanghai 16 200240, China
| |
Collapse
|
5
|
Han H, Ding C, Cheng X, Sang X, Liu T. iT4SE-EP: Accurate Identification of Bacterial Type IV Secreted Effectors by Exploring Evolutionary Features from Two PSI-BLAST Profiles. Molecules 2021; 26:molecules26092487. [PMID: 33923273 PMCID: PMC8123216 DOI: 10.3390/molecules26092487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 04/16/2021] [Accepted: 04/20/2021] [Indexed: 11/16/2022] Open
Abstract
Many gram-negative bacteria use type IV secretion systems to deliver effector molecules to a wide range of target cells. These substrate proteins, which are called type IV secreted effectors (T4SE), manipulate host cell processes during infection, often resulting in severe diseases or even death of the host. Therefore, identification of putative T4SEs has become a very active research topic in bioinformatics due to its vital roles in understanding host-pathogen interactions. PSI-BLAST profiles have been experimentally validated to provide important and discriminatory evolutionary information for various protein classification tasks. In the present study, an accurate computational predictor termed iT4SE-EP was developed for identifying T4SEs by extracting evolutionary features from the position-specific scoring matrix and the position-specific frequency matrix profiles. First, four types of encoding strategies were designed to transform protein sequences into fixed-length feature vectors based on the two profiles. Then, the feature selection technique based on the random forest algorithm was utilized to reduce redundant or irrelevant features without much loss of information. Finally, the optimal features were input into a support vector machine classifier to carry out the prediction of T4SEs. Our experimental results demonstrated that iT4SE-EP outperformed most of existing methods based on the independent dataset test.
Collapse
|
6
|
Computational prediction of secreted proteins in gram-negative bacteria. Comput Struct Biotechnol J 2021; 19:1806-1828. [PMID: 33897982 PMCID: PMC8047123 DOI: 10.1016/j.csbj.2021.03.019] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Revised: 03/18/2021] [Accepted: 03/18/2021] [Indexed: 12/29/2022] Open
Abstract
Gram-negative bacteria harness multiple protein secretion systems and secrete a large proportion of the proteome. Proteins can be exported to periplasmic space, integrated into membrane, transported into extracellular milieu, or translocated into cytoplasm of contacting cells. It is important for accurate, genome-wide annotation of the secreted proteins and their secretion pathways. In this review, we systematically classified the secreted proteins according to the types of secretion systems in Gram-negative bacteria, summarized the known features of these proteins, and reviewed the algorithms and tools for their prediction.
Collapse
|
7
|
Yu L, Liu F, Li Y, Luo J, Jing R. DeepT3_4: A Hybrid Deep Neural Network Model for the Distinction Between Bacterial Type III and IV Secreted Effectors. Front Microbiol 2021; 12:605782. [PMID: 33552038 PMCID: PMC7858263 DOI: 10.3389/fmicb.2021.605782] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2020] [Accepted: 01/04/2021] [Indexed: 01/17/2023] Open
Abstract
Gram-negative bacteria can deliver secreted proteins (also known as secreted effectors) directly into host cells through type III secretion system (T3SS), type IV secretion system (T4SS), and type VI secretion system (T6SS) and cause various diseases. These secreted effectors are heavily involved in the interactions between bacteria and host cells, so their identification is crucial for the discovery and development of novel anti-bacterial drugs. It is currently challenging to accurately distinguish type III secreted effectors (T3SEs) and type IV secreted effectors (T4SEs) because neither T3SEs nor T4SEs contain N-terminal signal peptides, and some of these effectors have similar evolutionary conserved profiles and sequence motifs. To address this challenge, we develop a deep learning (DL) approach called DeepT3_4 to correctly classify T3SEs and T4SEs. We generate amino-acid character dictionary and sequence-based features extracted from effector proteins and subsequently implement these features into a hybrid model that integrates recurrent neural networks (RNNs) and deep neural networks (DNNs). After training the model, the hybrid neural network classifies secreted effectors into two different classes with an accuracy, F-value, and recall of over 80.0%. Our approach stands for the first DL approach for the classification of T3SEs and T4SEs, providing a promising supplementary tool for further secretome studies.
Collapse
Affiliation(s)
- Lezheng Yu
- School of Chemistry and Materials Science, Guizhou Education University, Guiyang, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang, China
| | - Yizhou Li
- College of Cybersecurity, Sichuan University, Chengdu, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China
| | - Runyu Jing
- College of Cybersecurity, Sichuan University, Chengdu, China
| |
Collapse
|
8
|
Zhang J, Lv L, Lu D, Kong D, Al-Alashaari MAA, Zhao X. Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors. BMC Bioinformatics 2020; 21:480. [PMID: 33109082 PMCID: PMC7590791 DOI: 10.1186/s12859-020-03826-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 10/19/2020] [Indexed: 12/13/2022] Open
Abstract
Background Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. Results Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. Conclusions Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.
Collapse
Affiliation(s)
- Jian Zhang
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Lixin Lv
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Donglei Lu
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Denan Kong
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China
| | | | - Xudong Zhao
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China.
| |
Collapse
|
9
|
Chen T, Wang X, Chu Y, Wang Y, Jiang M, Wei DQ, Xiong Y. T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV Secreted Effectors Using eXtreme Gradient Boosting Algorithm. Front Microbiol 2020; 11:580382. [PMID: 33072049 PMCID: PMC7541839 DOI: 10.3389/fmicb.2020.580382] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Accepted: 08/21/2020] [Indexed: 12/19/2022] Open
Abstract
Type IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time- and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at https://github.com/CT001002/T4SE-XGB.
Collapse
Affiliation(s)
- Tianhang Chen
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Xiangeng Wang
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.,Department of Biomedical Sciences, City University of Hong Kong, Hong Kong, China
| | - Yanyi Chu
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.,Peng Cheng Laboratory, Shenzhen, China
| | - Yanjing Wang
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Mingming Jiang
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Dong-Qing Wei
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.,Peng Cheng Laboratory, Shenzhen, China
| | - Yi Xiong
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
10
|
Hong J, Luo Y, Mou M, Fu J, Zhang Y, Xue W, Xie T, Tao L, Lou Y, Zhu F. Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery. Brief Bioinform 2019; 21:1825-1836. [PMID: 31860715 DOI: 10.1093/bib/bbz120] [Citation(s) in RCA: 93] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2019] [Revised: 08/12/2019] [Accepted: 08/21/2019] [Indexed: 12/20/2022] Open
Abstract
The type IV bacterial secretion system (SS) is reported to be one of the most ubiquitous SSs in nature and can induce serious conditions by secreting type IV SS effectors (T4SEs) into the host cells. Recent studies mainly focus on annotating new T4SE from the huge amount of sequencing data, and various computational tools are therefore developed to accelerate T4SE annotation. However, these tools are reported as heavily dependent on the selected methods and their annotation performance need to be further enhanced. Herein, a convolution neural network (CNN) technique was used to annotate T4SEs by integrating multiple protein encoding strategies. First, the annotation accuracies of nine encoding strategies integrated with CNN were assessed and compared with that of the popular T4SE annotation tools based on independent benchmark. Second, false discovery rates of various models were systematically evaluated by (1) scanning the genome of Legionella pneumophila subsp. ATCC 33152 and (2) predicting the real-world non-T4SEs validated using published experiments. Based on the above analyses, the encoding strategies, (a) position-specific scoring matrix (PSSM), (b) protein secondary structure & solvent accessibility (PSSSA) and (c) one-hot encoding scheme (Onehot), were identified as well-performing when integrated with CNN. Finally, a novel strategy that collectively considers the three well-performing models (CNN-PSSM, CNN-PSSSA and CNN-Onehot) was proposed, and a new tool (CNN-T4SE, https://idrblab.org/cnnt4se/) was constructed to facilitate T4SE annotation. All in all, this study conducted a comprehensive analysis on the performance of a collection of encoding strategies when integrated with CNN, which could facilitate the suppression of T4SS in infection and limit the spread of antimicrobial resistance.
Collapse
Affiliation(s)
- Jiajun Hong
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yongchao Luo
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Jianbo Fu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yang Zhang
- School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Weiwei Xue
- School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Tian Xie
- Key Laboratory of Elemene Class Anti-Cancer Chinese Medicine of Zhejiang Province, School of Medicine, Hangzhou Normal University, Hangzhou 310036, China
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-Cancer Chinese Medicine of Zhejiang Province, School of Medicine, Hangzhou Normal University, Hangzhou 310036, China
| | - Yan Lou
- Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou 310000, Zhejiang, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
11
|
Liu Y, Guo Y, Wu W, Xiong Y, Sun C, Yuan L, Li M. A Machine Learning-Based QSAR Model for Benzimidazole Derivatives as Corrosion Inhibitors by Incorporating Comprehensive Feature Selection. Interdiscip Sci 2019; 11:738-747. [PMID: 31486019 DOI: 10.1007/s12539-019-00346-7] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Revised: 07/23/2019] [Accepted: 07/25/2019] [Indexed: 01/28/2023]
Abstract
BACKGROUND Computational prediction of inhibition efficiency (IE) for inhibitor molecules is a crucial supplementary way to design novel molecules that can efficiently inhibit corrosion onto metallic surfaces. PURPOSE Here we are dedicated to developing a new machine learning-based predictor for the inhibition efficiency (IE) of benzimidazole derivatives. METHODS First, a comprehensively numerical representation was given on inhibitor molecules from all aspects of energy, electronic, topological, physicochemical and spatial properties based on 3-D structures and 150 valid structural descriptors were obtained. Then, a thorough investigation of these structural descriptors was implemented. The multicollinearity-based clustering analysis was performed to remove the linear correlated feature variables, so 47 feature clusters were produced. Meanwhile, Gini importance by random forest (RF) was used to further measure the contributions of the descriptors in each cluster and 47 non-linear descriptors were selected with the highest Gini importance score in the corresponding cluster. Further, considering the limited number of available inhibitors, different feature subsets were constructed according to the Gini importance score ranking list of 47 descriptors. RESULTS Finally, support vector machine (SVM) models based on different feature subsets were tested by leave-one-out cross validation. Through comparisons, the optimal SVM model with the top 11 descriptors was achieved based on Poly kernel. This model yields a promising performance with the correlation coefficient (R) and root-mean-square error (RMSE) of 0.9589 and 4.45, respectively, which indicates that the method proposed by us gives the best performance for the current data. CONCLUSION Based on our model, 6 new benzimidazole molecules were designed and their IE values predicted by this model indicate that two of them have high potential as outstanding corrosion inhibitors.
Collapse
Affiliation(s)
- Youquan Liu
- Research Institute of Natural Gas Technology, Petro China Southwest Oil and Gas Field Company, Chengdu, 610213, China.
| | - Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu, Sichuan, 610064, People's Republic of China.
| | - Wengang Wu
- Research Institute of Natural Gas Technology, Petro China Southwest Oil and Gas Field Company, Chengdu, 610213, China
| | - Ying Xiong
- Research Institute of Natural Gas Technology, Petro China Southwest Oil and Gas Field Company, Chengdu, 610213, China
| | - Chuan Sun
- Research Institute of Natural Gas Technology, Petro China Southwest Oil and Gas Field Company, Chengdu, 610213, China
| | - Li Yuan
- Research Institute of Natural Gas Technology, Petro China Southwest Oil and Gas Field Company, Chengdu, 610213, China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu, Sichuan, 610064, People's Republic of China
| |
Collapse
|
12
|
Hu W, Qin L, Li M, Pu X, Guo Y. Individually double minimum-distance definition of protein-RNA binding residues and application to structure-based prediction. J Comput Aided Mol Des 2018; 32:1363-1373. [PMID: 30478757 DOI: 10.1007/s10822-018-0177-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2018] [Accepted: 11/14/2018] [Indexed: 01/01/2023]
Abstract
Identifying protein-RNA binding residues is essential for understanding the mechanism of protein-RNA interactions. So far, rigid distance thresholds are commonly used to define protein-RNA binding residues. However, after investigating 182 non-redundant protein-RNA complexes, we find that it would be unsuitable for a certain amount of complexes since the distances between proteins and RNAs vary widely. In this work, a novel definition method was proposed based on a flexible distance cutoff. This method can fully consider the individual differences among complexes by setting a variable tolerance limit of protein-RNA interactions, i.e. the double minimum-distance by which different distance thresholds are achieved for different complexes. In order to validate our method, a comprehensive comparison between our flexible method and traditional rigid methods was implemented in terms of interface structure, amino acid composition, interface area and interaction force, etc. The results indicate that this method is more reasonable because it incorporates the specificity of different complexes by extracting the important residues lost by rigid distance methods and discarding some redundant residues. Finally, to further test our double minimum-distance definition strategy, we developed a classifier to predict those binding sites derived from our new method by using structural features and a random forest machine learning algorithm. The model achieved a satisfactory prediction performance and the accuracy on independent data sets reaches to 85.0%. To the best of our knowledge, it is the first prediction model to define positive and negative samples using a flexible cutoff. So the comparison analysis and modeling results have demonstrated that our method would be a very promising strategy for more precisely defining protein-RNA binding sites.
Collapse
Affiliation(s)
- Wen Hu
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China
| | - Liu Qin
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China
| | - Xuemei Pu
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China
| | - Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China.
| |
Collapse
|
13
|
Xiong Y, Wang Q, Yang J, Zhu X, Wei DQ. PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method. Front Microbiol 2018; 9:2571. [PMID: 30416498 PMCID: PMC6212463 DOI: 10.3389/fmicb.2018.02571] [Citation(s) in RCA: 74] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2018] [Accepted: 10/09/2018] [Indexed: 11/13/2022] Open
Abstract
Gram-negative bacteria use various secretion systems to deliver their secreted effectors. Among them, type IV secretion system exists widely in a variety of bacterial species, and secretes type IV secreted effectors (T4SEs), which play vital roles in host-pathogen interactions. However, experimental approaches to identify T4SEs are time- and resource-consuming. In the present study, we aim to develop an in silico stacked ensemble method to predict whether a protein is an effector of type IV secretion system or not based on its sequence information. The protein sequences were encoded by the feature of position specific scoring matrix (PSSM)-composition by summing rows that correspond to the same amino acid residues in PSSM profiles. Based on the PSSM-composition features, we develop a stacked ensemble model PredT4SE-Stack to predict T4SEs, which utilized an ensemble of base-classifiers implemented by various machine learning algorithms, such as support vector machine, gradient boosting machine, and extremely randomized trees, to generate outputs for the meta-classifier in the classification system. Our results demonstrated that the framework of PredT4SE-Stack was a feasible and effective way to accurately identify T4SEs based on protein sequence information. The datasets and source code of PredT4SE-Stack are freely available at http://xbioinfo.sjtu.edu.cn/PredT4SE_Stack/index.php.
Collapse
Affiliation(s)
- Yi Xiong
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Qiankun Wang
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Junchen Yang
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, China
| | - Dong-Qing Wei
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
14
|
Liu B, Weng F, Huang DS, Chou KC. iRO-3wPseKNC: identify DNA replication origins by three-window-based PseKNC. Bioinformatics 2018; 34:3086-3093. [DOI: 10.1093/bioinformatics/bty312] [Citation(s) in RCA: 96] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2018] [Accepted: 04/18/2018] [Indexed: 12/16/2022] Open
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Gordon Life Science Institute, Belmont, MA, USA
| | - Fan Weng
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA, USA
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|