1
|
Hu Y, Wang Y, Hu X, Chao H, Li S, Ni Q, Zhu Y, Hu Y, Zhao Z, Chen M. T4SEpp: A pipeline integrating protein language models to predict bacterial type IV secreted effectors. Comput Struct Biotechnol J 2024; 23:801-812. [PMID: 38328004 PMCID: PMC10847861 DOI: 10.1016/j.csbj.2024.01.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/20/2024] [Accepted: 01/20/2024] [Indexed: 02/09/2024] Open
Abstract
Many pathogenic bacteria use type IV secretion systems (T4SSs) to deliver effectors (T4SEs) into the cytoplasm of eukaryotic cells, causing diseases. The identification of effectors is a crucial step in understanding the mechanisms of bacterial pathogenicity, but this remains a major challenge. In this study, we used the full-length embedding features generated by six pre-trained protein language models to train classifiers predicting T4SEs and compared their performance. We integrated three modules into a model called T4SEpp. The first module searched for full-length homologs of known T4SEs, signal sequences, and effector domains; the second module fine-tuned a machine learning model using data for a signal sequence feature; and the third module used the three best-performing pre-trained protein language models. T4SEpp outperformed other state-of-the-art (SOTA) software tools, achieving ∼0.98 accuracy at a high specificity of ∼0.99, based on the assessment of an independent validation dataset. T4SEpp predicted 13 T4SEs from Helicobacter pylori, including the well-known CagA and 12 other potential ones, among which eleven could potentially interact with human proteins. This suggests that these potential T4SEs may be associated with the pathogenicity of H. pylori. Overall, T4SEpp provides a better solution to assist in the identification of bacterial T4SEs and facilitates studies of bacterial pathogenicity. T4SEpp is freely accessible at https://bis.zju.edu.cn/T4SEpp.
Collapse
Affiliation(s)
- Yueming Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Yejun Wang
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Medical School, Shenzhen, China
- Department of Cell Biology and Genetics, College of Basic Medicine, Shenzhen University Medical School, Shenzhen, China
| | - Xiaotian Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Haoyu Chao
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Sida Li
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Qinyang Ni
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Yanyan Zhu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Yixue Hu
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Medical School, Shenzhen, China
| | - Ziyi Zhao
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Medical School, Shenzhen, China
| | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
- Institute of Hematology, Zhejiang University School of Medicine, The First Affiliated Hospital, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
2
|
Li Y, Wu X, Yang P, Jiang G, Luo Y. Machine Learning for Lung Cancer Diagnosis, Treatment, and Prognosis. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:850-866. [PMID: 36462630 PMCID: PMC10025752 DOI: 10.1016/j.gpb.2022.11.003] [Citation(s) in RCA: 72] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 10/03/2022] [Accepted: 11/17/2022] [Indexed: 12/03/2022]
Abstract
The recent development of imaging and sequencing technologies enables systematic advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-based approaches play a critical role in integrating and analyzing these large and complex datasets, which have extensively characterized lung cancer through the use of different perspectives from these accrued data. In this review, we provide an overview of machine learning-based approaches that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection, auxiliary diagnosis, prognosis prediction, and immunotherapy practice. Moreover, we highlight the challenges and opportunities for future applications of machine learning in lung cancer.
Collapse
Affiliation(s)
- Yawei Li
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Xin Wu
- Department of Medicine, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Ping Yang
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905 / Scottsdale, AZ 85259, USA
| | - Guoqian Jiang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55905, USA
| | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.
| |
Collapse
|
3
|
Chen Z, Zhao Z, Hui X, Zhang J, Hu Y, Chen R, Cai X, Hu Y, Wang Y. T1SEstacker: A Tri-Layer Stacking Model Effectively Predicts Bacterial Type 1 Secreted Proteins Based on C-Terminal Non-repeats-in-Toxin-Motif Sequence Features. Front Microbiol 2022; 12:813094. [PMID: 35211101 PMCID: PMC8861453 DOI: 10.3389/fmicb.2021.813094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 12/20/2021] [Indexed: 11/21/2022] Open
Abstract
Type 1 secretion systems play important roles in pathogenicity of Gram-negative bacteria. However, the substrate secretion mechanism remains largely unknown. In this research, we observed the sequence features of repeats-in-toxin (RTX) proteins, a major class of type 1 secreted effectors (T1SEs). We found striking non-RTX-motif amino acid composition patterns at the C termini, most typically exemplified by the enriched “[FLI][VAI]” at the most C-terminal two positions. Machine-learning models, including deep-learning ones, were trained using these sequence-based non-RTX-motif features and further combined into a tri-layer stacking model, T1SEstacker, which predicted the RTX proteins accurately, with a fivefold cross-validated sensitivity of ∼0.89 at the specificity of ∼0.94. Besides substrates with RTX motifs, T1SEstacker can also well distinguish non-RTX-motif T1SEs, further suggesting their potential existence of common secretion signals. T1SEstacker was applied to predict T1SEs from the genomes of representative Salmonella strains, and we found that both the number and composition of T1SEs varied among strains. The number of T1SEs is estimated to reach 100 or more in each strain, much larger than what we expected. In summary, we made comprehensive sequence analysis on the type 1 secreted RTX proteins, identified common sequence-based features at the C termini, and developed a stacking model that can predict type 1 secreted proteins accurately.
Collapse
Affiliation(s)
- Zewei Chen
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Health Science Center, Shenzhen, China
| | - Ziyi Zhao
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Health Science Center, Shenzhen, China
| | - Xinjie Hui
- Department of Respiratory Medicine, Xuanwu Hospital, Capital Medical University, Beijing, China
| | - Junya Zhang
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Health Science Center, Shenzhen, China
| | - Yixue Hu
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Health Science Center, Shenzhen, China
| | - Runhong Chen
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Health Science Center, Shenzhen, China
| | - Xuxia Cai
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Health Science Center, Shenzhen, China
| | - Yueming Hu
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Health Science Center, Shenzhen, China
| | - Yejun Wang
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Health Science Center, Shenzhen, China
| |
Collapse
|
4
|
Liu X, Hui X, Kang H, Fang Q, Chen A, Hu Y, Lu D, Chen X, Wang Y. A Multi-Gene Model Effectively Predicts the Overall Prognosis of Stomach Adenocarcinomas With Large Genetic Heterogeneity Using Somatic Mutation Features. Front Genet 2020; 11:940. [PMID: 33005171 PMCID: PMC7479248 DOI: 10.3389/fgene.2020.00940] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Accepted: 07/28/2020] [Indexed: 12/24/2022] Open
Abstract
Background Stomach adenocarcinoma (STAD) is one of the most common malignancies worldwide with poor prognosis. It remains unclear whether the prognosis is associated with somatic gene mutations. Methods In this research, we collected two independent STAD cohorts with both genetic profiling and clinical follow-up data, systematically investigated the association between the prognosis and somatic mutations, and analyzed the influence of heterogeneity on the prognosis-genetics association. Results Typical association was identified between somatic mutations and overall prognosis for individual cohorts. In The Cancer Genome Atlas (TCGA) cohort, a list of 24 genes was also identified that tended to mutate within cases of the poorest prognosis. The association showed apparent heterogeneity between different cohorts, although common signatures could be identified. A machine-learning model was trained with 20 common genes that showed a similar mutation rate difference between prognostic groups in the two cohorts, and it classified the cases in each cohort into two groups with significantly different prognosis. The model outperformed both single-gene models and TNM-based staging system significantly. Conclusion The study made a systematic analysis on the association between STAD prognosis and somatic mutations, identified signature genes that showed mutation preference in different prognostic groups, and developed an effective multi-gene model that can effectively predict the overall prognosis of STAD in different cohorts.
Collapse
Affiliation(s)
- Xianming Liu
- Department of Gastrointestinal Surgery, Shenzhen People's Hospital, The Second Clinical Medical College of Jinan University, Shenzhen, China
| | - Xinjie Hui
- School of Basic Medicine, Shenzhen University Health Science Center, Shenzhen, China
| | - Huayu Kang
- School of Basic Medicine, Shenzhen University Health Science Center, Shenzhen, China
| | - Qiongfang Fang
- School of Basic Medicine, Shenzhen University Health Science Center, Shenzhen, China
| | - Aiyue Chen
- School of Basic Medicine, Shenzhen University Health Science Center, Shenzhen, China
| | - Yueming Hu
- School of Basic Medicine, Shenzhen University Health Science Center, Shenzhen, China
| | - Desheng Lu
- School of Basic Medicine, Shenzhen University Health Science Center, Shenzhen, China
| | - Xianxiong Chen
- School of Basic Medicine, Shenzhen University Health Science Center, Shenzhen, China
| | - Yejun Wang
- School of Basic Medicine, Shenzhen University Health Science Center, Shenzhen, China
| |
Collapse
|
5
|
Abstract
Many Gram-negative bacteria infect hosts and cause diseases by translocating a variety of type III secreted effectors (T3SEs) into the host cell cytoplasm. However, despite a dramatic increase in the number of available whole-genome sequences, it remains challenging for accurate prediction of T3SEs. Traditional prediction models have focused on atypical sequence features buried in the N-terminal peptides of T3SEs, but unfortunately, these models have had high false-positive rates. In this research, we integrated promoter information along with characteristic protein features for signal regions, chaperone-binding domains, and effector domains for T3SE prediction. Machine learning algorithms, including deep learning, were adopted to predict the atypical features mainly buried in signal sequences of T3SEs, followed by development of a voting-based ensemble model integrating the individual prediction results. We assembled this into a unified T3SE prediction pipeline, T3SEpp, which integrated the results of individual modules, resulting in high accuracy (i.e., ∼0.94) and >1-fold reduction in the false-positive rate compared to that of state-of-the-art software tools. The T3SEpp pipeline and sequence features observed here will facilitate the accurate identification of new T3SEs, with numerous benefits for future studies on host-pathogen interactions.IMPORTANCE Type III secreted effector (T3SE) prediction remains a big computational challenge. In practical applications, current software tools often suffer problems of high false-positive rates. One of the causal factors could be the relatively unitary type of biological features used for the design and training of the models. In this research, we made a comprehensive survey on the sequence-based features of T3SEs, including signal sequences, chaperone-binding domains, effector domains, and transcription factor binding promoter sites, and assembled a unified prediction pipeline integrating multi-aspect biological features within homology-based and multiple machine learning models. To our knowledge, we have compiled the most comprehensive biological sequence feature analysis for T3SEs in this research. The T3SEpp pipeline integrating the variety of features and assembling different models showed high accuracy, which should facilitate more accurate identification of T3SEs in new and existing bacterial whole-genome sequences.
Collapse
|
6
|
Yu J, Hu Y, Xu Y, Wang J, Kuang J, Zhang W, Shao J, Guo D, Wang Y. LUADpp: an effective prediction model on prognosis of lung adenocarcinomas based on somatic mutational features. BMC Cancer 2019; 19:263. [PMID: 30902072 PMCID: PMC6431052 DOI: 10.1186/s12885-019-5433-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 03/03/2019] [Indexed: 02/08/2023] Open
Abstract
Background Lung adenocarcinoma is the most common type of lung cancers. Whole-genome sequencing studies disclosed the genomic landscape of lung adenocarcinomas. however, it remains unclear if the genetic alternations could guide prognosis prediction. Effective genetic markers and their based prediction models are also at a lack for prognosis evaluation. Methods We obtained the somatic mutation data and clinical data for 371 lung adenocarcinoma cases from The Cancer Genome Atlas. The cases were classified into two prognostic groups (3-year survival), and a comparison was performed between the groups for the somatic mutation frequencies of genes, followed by development of computational models to discrete the different prognosis. Results Genes were found with higher mutation rates in good (≥ 3-year survival) than in poor (< 3-year survival) prognosis group of lung adenocarcinoma patients. Genes participating in cell-cell adhesion and motility were significantly enriched in the top gene list with mutation rate difference between the good and poor prognosis group. Support Vector Machine models with the gene somatic mutation features could well predict prognosis, and the performance improved as feature size increased. An 85-gene model reached an average cross-validated accuracy of 81% and an Area Under the Curve (AUC) of 0.896 for the Receiver Operating Characteristic (ROC) curves. The model also exhibited good inter-stage prognosis prediction performance, with an average AUC of 0.846 for the ROC curves. Conclusion The prognosis of lung adenocarcinomas is related with somatic gene mutations. The genetic markers could be used for prognosis prediction and furthermore provide guidance for personal medicine. Electronic supplementary material The online version of this article (10.1186/s12885-019-5433-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jiaxian Yu
- Department of Cell Biology and Genetics, Shenzhen University Health Science Center, Shenzhen, 518060, China
| | - Yueming Hu
- Department of Cell Biology and Genetics, Shenzhen University Health Science Center, Shenzhen, 518060, China
| | - Yafei Xu
- Department of Cell Biology and Genetics, Shenzhen University Health Science Center, Shenzhen, 518060, China
| | - Jue Wang
- State Key Laboratory of Agrobiotechnology and School of Life Science, The Chinese University of Hong Kong, Shatin, Hong Kong, China
| | - Jiajie Kuang
- Department of Cell Biology and Genetics, Shenzhen University Health Science Center, Shenzhen, 518060, China
| | - Wei Zhang
- Sehnzhen GenRead Technology Co., Ltd., Shenzhen, 518000, China
| | - Jianlin Shao
- Zhejiang Hospital, 12 Lingyin Rd, Hangzhou, 310003, China
| | - Dianjing Guo
- State Key Laboratory of Agrobiotechnology and School of Life Science, The Chinese University of Hong Kong, Shatin, Hong Kong, China.
| | - Yejun Wang
- Department of Cell Biology and Genetics, Shenzhen University Health Science Center, Shenzhen, 518060, China.
| |
Collapse
|
7
|
Association of specific gene mutations derived from machine learning with survival in lung adenocarcinoma. PLoS One 2018; 13:e0207204. [PMID: 30419062 PMCID: PMC6231670 DOI: 10.1371/journal.pone.0207204] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Accepted: 10/27/2018] [Indexed: 12/20/2022] Open
Abstract
Lung cancer is the second most common cancer in the United States and the leading cause of mortality in cancer patients. Biomarkers predicting survival of patients with lung cancer have a profound effect on patient prognosis and treatment. However, predictive biomarkers for survival and their relevance for lung cancer are not been well known yet. The objective of this study was to perform machine learning with data from The Cancer Genome Atlas of patients with lung adenocarcinoma (LUAD) to find survival-specific gene mutations that could be used as survival-predicting biomarkers. To identify survival-specific mutations according to various clinical factors, four feature selection methods (information gain, chi-squared test, minimum redundancy maximum relevance, and correlation) were used. Extracted survival-specific mutations of LUAD were applied individually or as a group for Kaplan-Meier survival analysis. Mutations in MMRN2 and GMPPA were significantly associated with patient mortality while those in ZNF560 and SETX were associated with patient survival. Mutations in DNAJC2 and MMRN2 showed significant negative association with overall survival while mutations in ZNF560 showed significant positive association with overall survival. Mutations in MMRN2 showed significant negative association with disease-free survival while mutations in DRD3 and ZNF560 showed positive associated with disease-free survival. Mutations in DRD3, SETX, and ZNF560 showed significant positive association with survival in patients with LUAD while the opposite was true for mutations in DNAJC2, GMPPA, and MMRN2. These gene mutations were also found in other cohorts of LUAD, lung squamous cell carcinoma, and small cell lung cancer. In LUAD of Pan-Lung Cancer cohort, mutations in GMPPA, DNAJC2, and MMRN2 showed significant negative associations with survival of patients while mutations in DRD3 and SETX showed significant positive association with survival. In this study, machine learning was conducted to obtain information necessary to discover specific gene mutations associated with the survival of patients with LUAD. Mutations in the above six genes could predict survival rate and disease-free survival rate in patients with LUAD. Thus, they are important biomarker candidates for prognosis.
Collapse
|
8
|
Combination of Genetic Markers and Age Effectively Facilitates the Identification of People with High Risk of Preeclampsia in the Han Chinese Population. BIOMED RESEARCH INTERNATIONAL 2018; 2018:4808046. [PMID: 30112393 PMCID: PMC6077688 DOI: 10.1155/2018/4808046] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2018] [Revised: 05/15/2018] [Accepted: 06/11/2018] [Indexed: 01/03/2023]
Abstract
Objective This study aimed to analyze the possible association between known genetic risks and preeclampsia in a Han Chinese population. Methods A total of 156 patients with preeclampsia and 286 healthy Han Chinese women were enrolled and genotyped for 27 genetic alleles associated with preeclampsia in different populations. The association between the genotypes of the individual alleles and preeclampsia and the possible interaction among the alleles were analyzed. Finally logistic models were trained with the genotypes of possible alleles contributing to preeclampsia. Results Seven alleles were significantly or marginally significantly associated with preeclampsia, which involved six genes (rs4762 in AGT, rs1800896 in IL-10, rs1800629 and rs1799724 in TNFα, rs2070744 in NOS3, rs7412 in APOE, and rs2549782 in ERAP2). A multilocus interaction analysis further disclosed an interaction among seven alleles. A logistic model showing individual or synergetic contribution to preeclampsia could reach ~0.67 preeclampsia prediction accuracy in the Han Chinese population, while integration of age information could improve the performance to ~0.75 accuracy using a fivefold training-testing evaluation strategy. Conclusions The genetic factors were closely associated with preeclampsia in the Han Chinese population despite large ethnicity heterogeneity. The genotypes of different alleles also had synergetic interactions.
Collapse
|
9
|
Zhang S, Xu Y, Hui X, Yang F, Hu Y, Shao J, Liang H, Wang Y. Improvement in prediction of prostate cancer prognosis with somatic mutational signatures. J Cancer 2017; 8:3261-3267. [PMID: 29158798 PMCID: PMC5665042 DOI: 10.7150/jca.21261] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2017] [Accepted: 08/29/2017] [Indexed: 01/15/2023] Open
Abstract
Prostate cancer is a leading male malignancy worldwide, while the prognosis prediction remains quite inaccurate. The study aimed to observe whether there was an association between the prognosis of prostate cancer and genetic mutation profile, and to build an accurate prognostic predictor based on the genetic signatures. The patients diagnosed of prostate cancer from The Cancer Genomic Atlas were used for prognostic stratification, while the somatic gene mutation profiles were compared between different prognostic groups. The genetic features were further used for training machine-learning models to predict prostate cancer prognosis. No significant gene with somatic mutation rate difference was found between prognostic groups of prostate cancer. Total 43 atypical genes were screened for building a support vector machine model to predict prostate cancer prognosis, with an average accuracy of 66% and 64% for 5-fold cross-validation or training-testing evaluation respectively. When combined with the National Institute for Health and Care Excellence (NICE) features, the model could be further improved, with the 5-fold cross-validation accuracy of ~71%, much better than NICE itself (62%). To our knowledge, for the first time, the research studied the relationship of genome-wide somatic mutations with prostate prognosis, and developed an effective prognostic prediction model with the atypical genetic signatures.
Collapse
Affiliation(s)
- Shengping Zhang
- Dept. Surgical Urology, The Affiliated Longhua District People's Hospital of Southern Medical University, Shenzhen 518109, China
| | - Yafei Xu
- Dept. Cell Biology and Genetics, Shenzhen University Health Science Center, Shenzhen 518060, China
| | - Xinjie Hui
- Dept. Cell Biology and Genetics, Shenzhen University Health Science Center, Shenzhen 518060, China
| | - Fei Yang
- Dept. Surgical Urology, The third affiliated hospital, Sun Yat-Sen University, Guangzhou 510630, China
| | - Yueming Hu
- Dept. Cell Biology and Genetics, Shenzhen University Health Science Center, Shenzhen 518060, China
| | - Jianlin Shao
- First Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou 310003, China
| | - Hui Liang
- Dept. Surgical Urology, The Affiliated Longhua District People's Hospital of Southern Medical University, Shenzhen 518109, China
| | - Yejun Wang
- Dept. Cell Biology and Genetics, Shenzhen University Health Science Center, Shenzhen 518060, China
| |
Collapse
|