1
|
Cai J, Zhao J, Bin Y, Xia J, Zheng C. iAmyP: A Multi-view Learning for Amyloidogenic Hexapeptides Identification Based on Sequence Least Squares Programming. Interdiscip Sci 2025; 17:277-292. [PMID: 39546159 DOI: 10.1007/s12539-024-00666-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 10/07/2024] [Accepted: 10/09/2024] [Indexed: 11/17/2024]
Abstract
The development of peptide drug is hindered by the risk of amyloidogenic aggregation; if peptides tend to aggregate in this manner, they may be unsuitable for drug design. Computational methods aimed at predicting amyloidogenic sequences often face challenges in extracting high-quality features, and their predictive performance can be enchanced. To surmount these challenges, iAmyP was introduced as a specialized computational tool designed for predicting amyloidogenic hexapeptides. Utilizing multi-view learning, iAmyP incorporated sequence, structural, and evolutionary features, performing feature selection and feature fusion through recursive feature elimination and attention mechanisms. This amalgamation of features and subsequent feature selection and fusion lead to optimal performance facilitated by an optimization algorithm based on sequence least squares programming. Notably, iAmyP exhibited robust generalization for peptides with lengths of 7-10 amino acids. The role of hydrophobic amino acids in the aggregation process is critical, and a thorough analysis have significantly enhanced our insight into their significance in amyloidogenic hexapeptides. This tool represented an advancement in the development of peptide therapeutics by providing an understanding of amyloidogenic aggregation, establishing itself as a valuable framework for assessing amyloidogenic sequences. The data and code can be freely accessed at https://github.com/xialab-ahu/iAmyP .
Collapse
Affiliation(s)
- Jinling Cai
- College of Mathematics and System Science, Xinjiang University, Urumqi, 830046, China
| | - Jianping Zhao
- College of Mathematics and System Science, Xinjiang University, Urumqi, 830046, China.
| | - Yannan Bin
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Information Materials and Intelligent Sensing Laboratory of Anhui Province, and School of Artificial Intelligence, Anhui University, Hefei, 230601, China.
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, China.
| | - Junfeng Xia
- College of Mathematics and System Science, Xinjiang University, Urumqi, 830046, China.
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, China.
| | - Chunhou Zheng
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Information Materials and Intelligent Sensing Laboratory of Anhui Province, and School of Artificial Intelligence, Anhui University, Hefei, 230601, China.
| |
Collapse
|
2
|
Tahmid MT, Hasan AKMM, Bayzid MS. TransBind allows precise detection of DNA-binding proteins and residues using language models and deep learning. Commun Biol 2025; 8:568. [PMID: 40185915 PMCID: PMC11971327 DOI: 10.1038/s42003-025-07534-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 01/13/2025] [Indexed: 04/07/2025] Open
Abstract
Identifying DNA-binding proteins and their binding residues is critical for understanding diverse biological processes, but conventional experimental approaches are slow and costly. Existing machine learning methods, while faster, often lack accuracy and struggle with data imbalance, relying heavily on evolutionary profiles like PSSMs and HMMs derived from multiple sequence alignments (MSAs). These dependencies make them unsuitable for orphan proteins or those that evolve rapidly. To address these challenges, we introduce TransBind, an alignment-free deep learning framework that predicts DNA-binding proteins and residues directly from a single primary sequence, eliminating the need for MSAs. By leveraging features from pre-trained protein language models, TransBind effectively handles the issue of data imbalance and achieves superior performance. Extensive evaluations using diverse experimental datasets and case studies demonstrate that TransBind significantly outperforms state-of-the-art methods in terms of both accuracy and computational efficiency. TransBind is available as a web server at https://trans-bind-web-server-frontend.vercel.app/ .
Collapse
Affiliation(s)
- Md Toki Tahmid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - A K M Mehedi Hasan
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.
| |
Collapse
|
3
|
Chen R, You Y, Liu Y, Sun X, Ma T, Lao X, Zheng H. Deep-Learning-Based Approaches for Rational Design of Stapled Peptides With High Antimicrobial Activity and Stability. Microb Biotechnol 2025; 18:e70121. [PMID: 40042163 PMCID: PMC11881016 DOI: 10.1111/1751-7915.70121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2024] [Revised: 02/09/2025] [Accepted: 02/15/2025] [Indexed: 05/12/2025] Open
Abstract
Antimicrobial peptides (AMPs) face stability and toxicity challenges in clinical use. Stapled modification enhances their stability and effectiveness, but its application in peptide design is rarely reported. This study built ten prediction models for stapled AMPs using deep and machine learning, tested their accuracy with an independent data set and wet lab experiments, and characterised stapled loop structures using structural, sequence and amino acid descriptors. AlphaFold improved stapled peptide structure prediction. The support vector machine model performed best, while two deep learning models achieved the highest accuracy of 1.0 on an external test set. Designed cysteine- and lysine-stapled peptides inhibited various bacteria with low concentrations and showed good serum stability and low haemolytic activity. This study highlights the potential of the deep learning method in peptide modification and design.
Collapse
Affiliation(s)
- Ruole Chen
- School of Life Science and TechnologyChina Pharmaceutical UniversityNanjingJiangsuChina
| | - Yuhao You
- School of Life Science and TechnologyChina Pharmaceutical UniversityNanjingJiangsuChina
| | - Yanchao Liu
- School of Life Science and TechnologyChina Pharmaceutical UniversityNanjingJiangsuChina
| | - Xin Sun
- School of Life Science and TechnologyChina Pharmaceutical UniversityNanjingJiangsuChina
| | - Tianyue Ma
- School of Life Science and TechnologyChina Pharmaceutical UniversityNanjingJiangsuChina
| | - Xingzhen Lao
- School of Life Science and TechnologyChina Pharmaceutical UniversityNanjingJiangsuChina
| | - Heng Zheng
- School of Life Science and TechnologyChina Pharmaceutical UniversityNanjingJiangsuChina
| |
Collapse
|
4
|
Basu S, Yu J, Kihara D, Kurgan L. Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences. Brief Bioinform 2024; 26:bbaf016. [PMID: 39833102 PMCID: PMC11745544 DOI: 10.1093/bib/bbaf016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 12/24/2024] [Accepted: 01/06/2025] [Indexed: 01/22/2025] Open
Abstract
Computational prediction of nucleic acid-binding residues in protein sequences is an active field of research, with over 80 methods that were released in the past 2 decades. We identify and discuss 87 sequence-based predictors that include dozens of recently published methods that are surveyed for the first time. We overview historical progress and examine multiple practical issues that include availability and impact of predictors, key features of their predictive models, and important aspects related to their training and assessment. We observe that the past decade has brought increased use of deep neural networks and protein language models, which contributed to substantial gains in the predictive performance. We also highlight advancements in vital and challenging issues that include cross-predictions between deoxyribonucleic acid (DNA)-binding and ribonucleic acid (RNA)-binding residues and targeting the two distinct sources of binding annotations, structure-based versus intrinsic disorder-based. The methods trained on the structure-annotated interactions tend to perform poorly on the disorder-annotated binding and vice versa, with only a few methods that target and perform well across both annotation types. The cross-predictions are a significant problem, with some predictors of DNA-binding or RNA-binding residues indiscriminately predicting interactions with both nucleic acid types. Moreover, we show that methods with web servers are cited substantially more than tools without implementation or with no longer working implementations, motivating the development and long-term maintenance of the web servers. We close by discussing future research directions that aim to drive further progress in this area.
Collapse
Affiliation(s)
- Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Jing Yu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, 915 Mitch Daniels Boulevard, West Lafayette, IN 47907, United States
- Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, IN 47907, United States
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| |
Collapse
|
5
|
Kawai M, Fukuda A, Otomo R, Obata S, Minaga K, Asada M, Umemura A, Uenoyama Y, Hieda N, Morita T, Minami R, Marui S, Yamauchi Y, Nakai Y, Takada Y, Ikuta K, Yoshioka T, Mizukoshi K, Iwane K, Yamakawa G, Namikawa M, Sono M, Nagao M, Maruno T, Nakanishi Y, Hirai M, Kanda N, Shio S, Itani T, Fujii S, Kimura T, Matsumura K, Ohana M, Yazumi S, Kawanami C, Yamashita Y, Marusawa H, Watanabe T, Ito Y, Kudo M, Seno H. Early detection of pancreatic cancer by comprehensive serum miRNA sequencing with automated machine learning. Br J Cancer 2024; 131:1158-1168. [PMID: 39198617 PMCID: PMC11442445 DOI: 10.1038/s41416-024-02794-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 06/26/2024] [Accepted: 07/03/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND Pancreatic cancer is often diagnosed at advanced stages, and early-stage diagnosis of pancreatic cancer is difficult because of nonspecific symptoms and lack of available biomarkers. METHODS We performed comprehensive serum miRNA sequencing of 212 pancreatic cancer patient samples from 14 hospitals and 213 non-cancerous healthy control samples. We randomly classified the pancreatic cancer and control samples into two cohorts: a training cohort (N = 185) and a validation cohort (N = 240). We created ensemble models that combined automated machine learning with 100 highly expressed miRNAs and their combination with CA19-9 and validated the performance of the models in the independent validation cohort. RESULTS The diagnostic model with the combination of the 100 highly expressed miRNAs and CA19-9 could discriminate pancreatic cancer from non-cancer healthy control with high accuracy (area under the curve (AUC), 0.99; sensitivity, 90%; specificity, 98%). We validated high diagnostic accuracy in an independent asymptomatic early-stage (stage 0-I) pancreatic cancer cohort (AUC:0.97; sensitivity, 67%; specificity, 98%). CONCLUSIONS We demonstrate that the 100 highly expressed miRNAs and their combination with CA19-9 could be biomarkers for the specific and early detection of pancreatic cancer.
Collapse
Affiliation(s)
- Munenori Kawai
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan
| | - Akihisa Fukuda
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan.
| | - Ryo Otomo
- Research and Development Division, ARKRAY, Inc., Yousuien-nai, 59 Gansuin-cho, Kamigyo-ku, Kyoto, Japan
| | - Shunsuke Obata
- Research and Development Division, ARKRAY, Inc., Yousuien-nai, 59 Gansuin-cho, Kamigyo-ku, Kyoto, Japan
| | - Kosuke Minaga
- Department of Gastroenterology and Hepatology, Kindai University Faculty of Medicine, Osaka, Japan
| | - Masanori Asada
- Department of Gastroenterology and Hepatology, Osaka Red Cross Hospital, Osaka, Japan
| | - Atsushi Umemura
- Department of Pharmacology, Kyoto Prefectural University of Medicine, Kyoto, Japan
| | - Yoshito Uenoyama
- Department of Gastroenterology and Hepatology, Japanese Red Cross Wakayama Medical Center, Wakayama, Japan
| | - Nobuhiro Hieda
- Department of Gastroenterology, Otsu Red Cross Hospital, Shiga, Japan
| | - Toshihiro Morita
- Department of Gastroenterology and Hepatology, Kitano Hospital, Tazuke Kofukai Medical Research Institute, Osaka, Japan
| | - Ryuki Minami
- Department of Gastroenterology, Tenri Hospital, Nara, Japan
| | - Saiko Marui
- Department of Gastroenterology and Hepatology, Shiga General Hospital, Shiga, Japan
| | - Yuki Yamauchi
- Department of Gastroenterology, Hyogo Prefectural Amagasaki General Medical Center, Amagasaki, Japan
| | - Yoshitaka Nakai
- Department of Gastroenterology and Hepatology, Kyoto Katsura Hospital, Kyoto, Japan
| | - Yutaka Takada
- Department of Gastroenterology and Hepatology, Kobe City Nishi-Kobe Medical Center, Kobe, Japan
| | - Kozo Ikuta
- Division of Gastroenterology, Shinko Hospital, Kobe, Japan
| | - Takuto Yoshioka
- Department of Gastroenterology and Hepatology, Takatsuki Red Cross Hospital, Takatsuki, Japan
| | - Kenta Mizukoshi
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan
| | - Kosuke Iwane
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan
| | - Go Yamakawa
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan
| | - Mio Namikawa
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan
| | - Makoto Sono
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan
| | - Munemasa Nagao
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan
| | - Takahisa Maruno
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan
| | - Yuki Nakanishi
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan
| | - Mitsuharu Hirai
- Research and Development Division, ARKRAY, Inc., Yousuien-nai, 59 Gansuin-cho, Kamigyo-ku, Kyoto, Japan
| | - Naoki Kanda
- Department of Gastroenterology and Hepatology, Takatsuki Red Cross Hospital, Takatsuki, Japan
| | - Seiji Shio
- Division of Gastroenterology, Shinko Hospital, Kobe, Japan
| | - Toshinao Itani
- Department of Gastroenterology and Hepatology, Kobe City Nishi-Kobe Medical Center, Kobe, Japan
| | - Shigehiko Fujii
- Department of Gastroenterology and Hepatology, Kyoto Katsura Hospital, Kyoto, Japan
| | - Toshiyuki Kimura
- Department of Gastroenterology, Hyogo Prefectural Amagasaki General Medical Center, Amagasaki, Japan
| | - Kazuyoshi Matsumura
- Department of Gastroenterology and Hepatology, Shiga General Hospital, Shiga, Japan
| | - Masaya Ohana
- Department of Gastroenterology, Tenri Hospital, Nara, Japan
| | - Shujiro Yazumi
- Department of Gastroenterology and Hepatology, Kitano Hospital, Tazuke Kofukai Medical Research Institute, Osaka, Japan
| | - Chiharu Kawanami
- Department of Gastroenterology, Otsu Red Cross Hospital, Shiga, Japan
| | - Yukitaka Yamashita
- Department of Gastroenterology and Hepatology, Japanese Red Cross Wakayama Medical Center, Wakayama, Japan
| | - Hiroyuki Marusawa
- Department of Gastroenterology and Hepatology, Osaka Red Cross Hospital, Osaka, Japan
| | - Tomohiro Watanabe
- Department of Gastroenterology and Hepatology, Kindai University Faculty of Medicine, Osaka, Japan
| | - Yoshito Ito
- Department of Gastroenterology and Hepatology, Kyoto Prefectural University of Medicine, Kyoto, Japan
| | - Masatoshi Kudo
- Department of Gastroenterology and Hepatology, Kindai University Faculty of Medicine, Osaka, Japan
| | - Hiroshi Seno
- Department of Gastroenterology and Hepatology, Kyoto University Graduate School of Medicine, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto, Japan
| |
Collapse
|
6
|
Liu K, Geng S, Shen P, Zhao L, Zhou P, Liu W. Development and application of a machine learning-based predictive model for obstructive sleep apnea screening. Front Big Data 2024; 7:1353469. [PMID: 38817683 PMCID: PMC11137315 DOI: 10.3389/fdata.2024.1353469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Accepted: 04/29/2024] [Indexed: 06/01/2024] Open
Abstract
Objective To develop a robust machine learning prediction model for the automatic screening and diagnosis of obstructive sleep apnea (OSA) using five advanced algorithms, namely Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Support Vector Machine (SVM), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF) to provide substantial support for early clinical diagnosis and intervention. Methods We conducted a retrospective analysis of clinical data from 439 patients who underwent polysomnography at the Affiliated Hospital of Xuzhou Medical University between October 2019 and October 2022. Predictor variables such as demographic information [age, sex, height, weight, body mass index (BMI)], medical history, and Epworth Sleepiness Scale (ESS) were used. Univariate analysis was used to identify variables with significant differences, and the dataset was then divided into training and validation sets in a 4:1 ratio. The training set was established to predict OSA severity grading. The validation set was used to assess model performance using the area under the curve (AUC). Additionally, a separate analysis was conducted, categorizing the normal population as one group and patients with moderate-to-severe OSA as another. The same univariate analysis was applied, and the dataset was divided into training and validation sets in a 4:1 ratio. The training set was used to build a prediction model for screening moderate-to-severe OSA, while the validation set was used to verify the model's performance. Results Among the four groups, the LightGBM model outperformed others, with the top five feature importance rankings of ESS total score, BMI, sex, hypertension, and gastroesophageal reflux (GERD), where Age, ESS total score and BMI played the most significant roles. In the dichotomous model, RF is the best performer of the five models respectively. The top five ranked feature importance of the best-performing RF models were ESS total score, BMI, GERD, age and Dry mouth, with ESS total score and BMI being particularly pivotal. Conclusion Machine learning-based prediction models for OSA disease grading and screening prove instrumental in the early identification of patients with moderate-to-severe OSA, revealing pertinent risk factors and facilitating timely interventions to counter pathological changes induced by OSA. Notably, ESS total score and BMI emerge as the most critical features for predicting OSA, emphasizing their significance in clinical assessments. The dataset will be publicly available on my Github.
Collapse
Affiliation(s)
- Kang Liu
- Department of Otolaryngology, Head and Neck Surgery, Affiliated Hospital of Xuzhou Medical University, Xuzhou, China
| | - Shi Geng
- Artificial Intelligence Unit, Department of Medical Equipment Management, Affiliated Hospital of Xuzhou Medical University, Xuzhou, China
| | - Ping Shen
- Department of Otolaryngology, Head and Neck Surgery, Affiliated Hospital of Xuzhou Medical University, Xuzhou, China
| | - Lei Zhao
- Artificial Intelligence Unit, Department of Medical Equipment Management, Affiliated Hospital of Xuzhou Medical University, Xuzhou, China
| | - Peng Zhou
- Department of Otolaryngology, Head and Neck Surgery, Affiliated Hospital of Xuzhou Medical University, Xuzhou, China
| | - Wen Liu
- Department of Otolaryngology, Head and Neck Surgery, Affiliated Hospital of Xuzhou Medical University, Xuzhou, China
| |
Collapse
|
7
|
Tao L, Zhou T, Wu Z, Hu F, Yang S, Kong X, Li C. ESPDHot: An Effective Machine Learning-Based Approach for Predicting Protein-DNA Interaction Hotspots. J Chem Inf Model 2024; 64:3548-3557. [PMID: 38587997 DOI: 10.1021/acs.jcim.3c02011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/10/2024]
Abstract
Protein-DNA interactions are pivotal to various cellular processes. Precise identification of the hotspot residues for protein-DNA interactions holds great significance for revealing the intricate mechanisms in protein-DNA recognition and for providing essential guidance for protein engineering. Aiming at protein-DNA interaction hotspots, this work introduces an effective prediction method, ESPDHot based on a stacked ensemble machine learning framework. Here, the interface residue whose mutation leads to a binding free energy change (ΔΔG) exceeding 2 kcal/mol is defined as a hotspot. To tackle the imbalanced data set issue, the adaptive synthetic sampling (ADASYN), an oversampling technique, is adopted to synthetically generate new minority samples, thereby rectifying data imbalance. As for molecular characteristics, besides traditional features, we introduce three new characteristic types including residue interface preference proposed by us, residue fluctuation dynamics characteristics, and coevolutionary features. Combining the Boruta method with our previously developed Random Grouping strategy, we obtained an optimal set of features. Finally, a stacking classifier is constructed to output prediction results, which integrates three classical predictors, Support Vector Machine (SVM), XGBoost, and Artificial Neural Network (ANN) as the first layer, and Logistic Regression (LR) algorithm as the second one. Notably, ESPDHot outperforms the current state-of-the-art predictors, achieving superior performance on the independent test data set, with F1, MCC, and AUC reaching 0.571, 0.516, and 0.870, respectively.
Collapse
Affiliation(s)
- Lianci Tao
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Tong Zhou
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Zhixiang Wu
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Fangrui Hu
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Shuang Yang
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Xiaotian Kong
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Chunhua Li
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| |
Collapse
|
8
|
Kabir MWU, Alawad DM, Pokhrel P, Hoque MT. DRBpred: A sequence-based machine learning method to effectively predict DNA- and RNA-binding residues. Comput Biol Med 2024; 170:108081. [PMID: 38295475 PMCID: PMC10922697 DOI: 10.1016/j.compbiomed.2024.108081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 01/12/2024] [Accepted: 01/27/2024] [Indexed: 02/02/2024]
Abstract
DNA-binding and RNA-binding proteins are essential to an organism's normal life cycle. These proteins have diverse functions in various biological processes. DNA-binding proteins are crucial for DNA replication, transcription, repair, packaging, and gene expression. Likewise, RNA-binding proteins are essential for the post-transcriptional control of RNAs and RNA metabolism. Identifying DNA- and RNA-binding residue is essential for biological research and understanding the pathogenesis of many diseases. However, most DNA-binding and RNA-binding proteins still need to be discovered. This research explored various properties of the protein sequences, such as amino acid composition type, Position-Specific Scoring Matrix (PSSM) values of amino acids, Hidden Markov model (HMM) profiles, physiochemical properties, structural properties, torsion angles, and disorder regions. We utilized a sliding window technique to extract more information from a target residue's neighbors. We proposed an optimized Light Gradient Boosting Machine (LightGBM) method, named DRBpred, to predict DNA-binding and RNA-binding residues from the protein sequence. DRBpred shows an improvement of 112.00 %, 33.33 %, and 6.49 % for the DNA-binding test set compared to the state-of-the-art method. It shows an improvement of 112.50 %, 16.67 %, and 7.46 % for the RNA-binding test set regarding Sensitivity, Mathews Correlation Coefficient (MCC), and AUC metric.
Collapse
Affiliation(s)
- Md Wasi Ul Kabir
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| | - Duaa Mohammad Alawad
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| | - Pujan Pokhrel
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| | - Md Tamjidul Hoque
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| |
Collapse
|
9
|
Zhu L, Wang L, Yang Z, Xu P, Yang S. PPSNO: A Feature-Rich SNO Sites Predictor by Stacking Ensemble Strategy from Protein Sequence-Derived Information. Interdiscip Sci 2024; 16:192-217. [PMID: 38206557 DOI: 10.1007/s12539-023-00595-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 11/20/2023] [Accepted: 11/21/2023] [Indexed: 01/12/2024]
Abstract
The protein S-nitrosylation (SNO) is a significant post-translational modification that affects the stability, activity, cellular localization, and function of proteins. Therefore, highly accurate prediction of SNO sites aids in grasping biological function mechanisms. In this document, we have constructed a predictor, named PPSNO, forecasting protein SNO sites using stacked integrated learning. PPSNO integrates multiple machine learning techniques into an ensemble model, enhancing its predictive accuracy. First, we established benchmark datasets by collecting SNO sites from various sources, including literature, databases, and other predictors. Second, various techniques for feature extraction are applied to derive characteristics from protein sequences, which are subsequently amalgamated into the PPSNO predictor for training. Five-fold cross-validation experiments show that PPSNO outperformed existing predictors, such as PSNO, PreSNO, pCysMod, DeepNitro, RecSNO, and Mul-SNO. The PPSNO predictor achieved an impressive accuracy of 92.8%, an area under the curve (AUC) of 96.1%, a Matthews correlation coefficient (MCC) of 81.3%, an F1-score of 85.6%, an SN of 79.3%, an SP of 97.7%, and an average precision (AP) of 92.2%. We also employed ROC curves, PR curves, and radar plots to show the superior performance of PPSNO. Our study shows that fused protein sequence features and two-layer stacked ensemble models can improve the accuracy of predicting SNO sites, which can aid in comprehending cellular processes and disease mechanisms. The codes and data are available at https://github.com/serendipity-wly/PPSNO .
Collapse
Affiliation(s)
- Lun Zhu
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China
| | - Liuyang Wang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China
| | - Zexi Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China
| | - Piao Xu
- College of Economics and Management, Nanjing Forestry University, Nanjing, 210037, China
| | - Sen Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China.
- The Affiliated Changzhou No. 2 People's Hospital of Nanjing Medical University, Changzhou, 213164, China.
| |
Collapse
|
10
|
Zhang J, Basu S, Kurgan L. HybridDBRpred: improved sequence-based prediction of DNA-binding amino acids using annotations from structured complexes and disordered proteins. Nucleic Acids Res 2024; 52:e10. [PMID: 38048333 PMCID: PMC10810184 DOI: 10.1093/nar/gkad1131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 11/10/2023] [Indexed: 12/06/2023] Open
Abstract
Current predictors of DNA-binding residues (DBRs) from protein sequences belong to two distinct groups, those trained on binding annotations extracted from structured protein-DNA complexes (structure-trained) vs. intrinsically disordered proteins (disorder-trained). We complete the first empirical analysis of predictive performance across the structure- and disorder-annotated proteins for a representative collection of ten predictors. Majority of the structure-trained tools perform well on the structure-annotated proteins while doing relatively poorly on the disorder-annotated proteins, and vice versa. Several methods make accurate predictions for the structure-annotated proteins or the disorder-annotated proteins, but none performs highly accurately for both annotation types. Moreover, most predictors make excessive cross-predictions for the disorder-annotated proteins, where residues that interact with non-DNA ligand types are predicted as DBRs. Motivated by these results, we design, validate and deploy an innovative meta-model, hybridDBRpred, that uses deep transformer network to combine predictions generated by three best current predictors. HybridDBRpred provides accurate predictions and low levels of cross-predictions across the two annotation types, and is statistically more accurate than each of the ten tools and baseline meta-predictors that rely on averaging and logistic regression. We deploy hybridDBRpred as a convenient web server at http://biomine.cs.vcu.edu/servers/hybridDBRpred/ and provide the corresponding source code at https://github.com/jianzhang-xynu/hybridDBRpred.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, PR China
| | - Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
11
|
Pradhan UK, Meher PK, Naha S, Pal S, Gupta S, Gupta A, Parsad R. RBPLight: a computational tool for discovery of plant-specific RNA-binding proteins using light gradient boosting machine and ensemble of evolutionary features. Brief Funct Genomics 2023; 22:401-410. [PMID: 37158175 DOI: 10.1093/bfgp/elad016] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/12/2023] [Accepted: 04/21/2023] [Indexed: 05/10/2023] Open
Abstract
RNA-binding proteins (RBPs) are essential for post-transcriptional gene regulation in eukaryotes, including splicing control, mRNA transport and decay. Thus, accurate identification of RBPs is important to understand gene expression and regulation of cell state. In order to detect RBPs, a number of computational models have been developed. These methods made use of datasets from several eukaryotic species, specifically from mice and humans. Although some models have been tested on Arabidopsis, these techniques fall short of correctly identifying RBPs for other plant species. Therefore, the development of a powerful computational model for identifying plant-specific RBPs is needed. In this study, we presented a novel computational model for locating RBPs in plants. Five deep learning models and ten shallow learning algorithms were utilized for prediction with 20 sequence-derived and 20 evolutionary feature sets. The highest repeated five-fold cross-validation accuracy, 91.24% AU-ROC and 91.91% AU-PRC, was achieved by light gradient boosting machine. While evaluated using an independent dataset, the developed approach achieved 94.00% AU-ROC and 94.50% AU-PRC. The proposed model achieved significantly higher accuracy for predicting plant-specific RBPs as compared to the currently available state-of-art RBP prediction models. Despite the fact that certain models have already been trained and assessed on the model organism Arabidopsis, this is the first comprehensive computer model for the discovery of plant-specific RBPs. The web server RBPLight was also developed, which is publicly accessible at https://iasri-sg.icar.gov.in/rbplight/, for the convenience of researchers to identify RBPs in plants.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Soumen Pal
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sagar Gupta
- CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur (HP) 176061, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
12
|
Jain A, Begum T, Ahmad S. Analysis and Prediction of Pathogen Nucleic Acid Specificity for Toll-like Receptors in Vertebrates. J Mol Biol 2023; 435:168208. [PMID: 37479078 DOI: 10.1016/j.jmb.2023.168208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 06/20/2023] [Accepted: 07/13/2023] [Indexed: 07/23/2023]
Abstract
Identification of key sequence, expression and function related features of nucleic acid-sensing host proteins is of fundamental importance to understand the dynamics of pathogen-specific host responses. To meet this objective, we considered toll-like receptors (TLRs), a representative class of membrane-bound sensor proteins, from 17 vertebrate species covering mammals, birds, reptiles, amphibians, and fishes in this comparative study. We identified the molecular signatures of host TLRs that are responsible for sensing pathogen nucleic acids or other pathogen-associated molecular patterns (PAMPs), and potentially play important roles in host defence mechanism. Interestingly, our findings reveal that such host-specific features are directly related to the strand (single or double) specificity of nucleic acid from pathogens. However, during host-pathogen interactions, such features were unable to explain the pathogenic PAMP (i.e., DNA, RNA or other) selectivity, suggesting a more complex mechanism. Using these features, we developed a number of machine learning models, of which Random Forest achieved a high performance (94.57% accuracy) to predict strand specificity of TLRs from protein-derived features. We applied the trained model to propose strand specificity of some previously uncharacterized distinct fish-specific novel TLRs (TLR18, TLR23, TLR24, TLR25, TLR27).
Collapse
Affiliation(s)
- Anuja Jain
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India. https://twitter.com/@Anuja334
| | - Tina Begum
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India.
| | - Shandar Ahmad
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India.
| |
Collapse
|
13
|
Rai BK, Apgar JR, Bennett EM. Low-data interpretable deep learning prediction of antibody viscosity using a biophysically meaningful representation. Sci Rep 2023; 13:2917. [PMID: 36806303 PMCID: PMC9941094 DOI: 10.1038/s41598-023-28841-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 01/25/2023] [Indexed: 02/22/2023] Open
Abstract
Deep learning, aided by the availability of big data sets, has led to substantial advances across many disciplines. However, many scientific problems of practical interest lack sufficiently large datasets amenable to deep learning. Prediction of antibody viscosity is one such problem where deep learning methods have not yet been explored due to the relative scarcity of relevant training data. In this work, we overcome this limitation using a biophysically meaningful representation that enables us to develop generalizable models even under limited training data. We present, PfAbNet-viscosity, a 3D convolutional neural network architecture, to predict high-concentration viscosity of therapeutic antibodies. We show that with the electrostatic potential surface of the antibody variable region as the only input to the network, the models trained on as few as couple dozen datapoints can generalize with high accuracy. Our feature attribution analysis shows that PfAbNet-viscosity has learned key biophysical drivers of viscosity. The applicability of our approach to other biological systems is discussed.
Collapse
Affiliation(s)
- Brajesh K Rai
- Pfizer Worldwide Research Development and Medical, Machine Learning and Computational Sciences, 610 Main Street, Cambridge, MA, 02139, USA.
| | - James R Apgar
- Pfizer Worldwide Research Development and Medical, Biomedicine Design, 610 Main Street, Cambridge, MA, 02139, USA
| | - Eric M Bennett
- Pfizer Worldwide Research Development and Medical, Biomedicine Design, 610 Main Street, Cambridge, MA, 02139, USA
| |
Collapse
|
14
|
Zhu Y, Liu Y, Chen Y, Li L. ResSUMO: A Deep Learning Architecture Based on Residual Structure for Prediction of Lysine SUMOylation Sites. Cells 2022; 11:2646. [PMID: 36078053 PMCID: PMC9454673 DOI: 10.3390/cells11172646] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 08/18/2022] [Accepted: 08/22/2022] [Indexed: 12/26/2022] Open
Abstract
Lysine SUMOylation plays an essential role in various biological functions. Several approaches integrating various algorithms have been developed for predicting SUMOylation sites based on a limited dataset. Recently, the number of identified SUMOylation sites has significantly increased due to investigation at the proteomics scale. We collected modification data and found the reported approaches had poor performance using our collected data. Therefore, it is essential to explore the characteristics of this modification and construct prediction models with improved performance based on an enlarged dataset. In this study, we constructed and compared 16 classifiers by integrating four different algorithms and four encoding features selected from 11 sequence-based or physicochemical features. We found that the convolution neural network (CNN) model integrated with residue structure, dubbed ResSUMO, performed favorably when compared with the traditional machine learning and CNN models in both cross-validation and independent tests. The area under the receiver operating characteristic (ROC) curve for ResSUMO was around 0.80, superior to that of the reported predictors. We also found that increasing the depth of neural networks in the CNN models did not improve prediction performance due to the degradation problem, but the residual structure could be included to optimize the neural networks and improve performance. This indicates that residual neural networks have the potential to be broadly applied in the prediction of other types of modification sites with great effectiveness and robustness. Furthermore, the online ResSUMO service is freely accessible.
Collapse
Affiliation(s)
- Yafei Zhu
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
| | - Yuhai Liu
- Dawning International Information Industry, Co., Ltd., Qingdao 266101, China
| | - Yu Chen
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
| | - Lei Li
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
- Faculty of Biomedical and Rehabilitation Engineering, University of Health and Rehabilitation Sciences, Qingdao 266001, China
| |
Collapse
|
15
|
Li J, Zhu W, Zhou J, Yun W, Li X, Guan Q, Lv W, Cheng Y, Ni H, Xie Z, Li M, Zhang L, Xu Y, Zhang Q. A Presurgical Unfavorable Prediction Scale of Endovascular Treatment for Acute Ischemic Stroke. Front Aging Neurosci 2022; 14:942285. [PMID: 35847671 PMCID: PMC9284674 DOI: 10.3389/fnagi.2022.942285] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 06/02/2022] [Indexed: 11/13/2022] Open
Abstract
ObjectiveTo develop a prognostic prediction model of endovascular treatment (EVT) for acute ischemic stroke (AIS) induced by large-vessel occlusion (LVO), this study applied machine learning classification model light gradient boosting machine (LightGBM) to construct a unique prediction model.MethodsA total of 973 patients were enrolled, primary outcome was assessed with modified Rankin scale (mRS) at 90 days, and favorable outcome was defined using mRS 0–2 scores. Besides, LightGBM algorithm and logistic regression (LR) were used to construct a prediction model. Then, a prediction scale was further established and verified by both internal data and other external data.ResultsA total of 20 presurgical variables were analyzed using LR and LightGBM. The results of LightGBM algorithm indicated that the accuracy and precision of the prediction model were 73.77 and 73.16%, respectively. The area under the curve (AUC) was 0.824. Furthermore, the top 5 variables suggesting unfavorable outcomes were namely admitting blood glucose levels, age, onset to EVT time, onset to hospital time, and National Institutes of Health Stroke Scale (NIHSS) scores (importance = 130.9, 102.6, 96.5, 89.5 and 84.4, respectively). According to AUC, we established the key cutoff points and constructed prediction scale based on their respective weightings. Then, the established prediction scale was verified in raw and external data and the sensitivity was 80.4 and 83.5%, respectively. Finally, scores >3 demonstrated better accuracy in predicting unfavorable outcomes.ConclusionPresurgical prediction scale is feasible and accurate in identifying unfavorable outcomes of AIS after EVT.
Collapse
Affiliation(s)
- Jingwei Li
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
- Institute of Brain Sciences, Nanjing University, Nanjing, China
- Jiangsu Key Laboratory for Molecular Medicine, Medical School of Nanjing University, Nanjing, China
- Jiangsu Province Stroke Center for Diagnosis and Therapy, Nanjing, China
- Nanjing Neurology Clinic Medical Center, Nanjing, China
| | - Wencheng Zhu
- The Institute of Software, Chinese Academy of Sciences, Beijing, China
| | - Junshan Zhou
- Department of Neurology, Nanjing First Hospital, Nanjing Medical University, Nanjing, China
| | - Wenwei Yun
- Department of Neurology, Changzhou No.2 People's Hospital Affiliated to Nanjing Medical University, Changzhou, China
| | - Xiaobo Li
- Department of Neurology, Northern Jiangsu People's Hospital, Clinical Medical School of Yangzhou University, Yangzhou, China
| | - Qiaochu Guan
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Weiping Lv
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Yue Cheng
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Huanyu Ni
- Department of Pharmacy of Drum Tower Hospital, Medical School, Nanjing University, Nanjing, China
| | - Ziyi Xie
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Mengyun Li
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Lu Zhang
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
| | - Yun Xu
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
- Institute of Brain Sciences, Nanjing University, Nanjing, China
- Jiangsu Key Laboratory for Molecular Medicine, Medical School of Nanjing University, Nanjing, China
- Jiangsu Province Stroke Center for Diagnosis and Therapy, Nanjing, China
- Nanjing Neurology Clinic Medical Center, Nanjing, China
| | - Qingxiu Zhang
- Department of Neurology of Drum Tower Hospital, Medical School and the State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing, China
- Institute of Brain Sciences, Nanjing University, Nanjing, China
- Jiangsu Key Laboratory for Molecular Medicine, Medical School of Nanjing University, Nanjing, China
- Jiangsu Province Stroke Center for Diagnosis and Therapy, Nanjing, China
- Nanjing Neurology Clinic Medical Center, Nanjing, China
- *Correspondence: Qingxiu Zhang
| |
Collapse
|
16
|
Arya A, Mary Varghese D, Kumar Verma A, Ahmad S. Inadequacy of evolutionary profiles vis-a-vis single sequences in predicting transient DNA-binding sites in proteins. J Mol Biol 2022; 434:167640. [PMID: 35597551 DOI: 10.1016/j.jmb.2022.167640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 05/01/2022] [Accepted: 05/16/2022] [Indexed: 10/18/2022]
Abstract
Sequence-based prediction of DNA-binding residues in a protein is a widely studied problem for which machine learning methods with continuously improving predictive power have been developed. Concatenated rows within a sliding window of a Position Specific Substitution Matrix (PSSM) of the protein concerned are currently used as the primary feature set in almost all the methods of predicting DNA-binding residues. Here we report that these evolutionary profiles are powerful, only for identifying conserved binding sites and fall short for the residue positions which undergo binding to non-binding transitions in closely related proteins. We created a database of highly similar protein pairs with known protein-DNA complexes and investigated differential predictability of conserved and transient binding within each pair. Retraining machine learning models uniformly, we compared the predictive powers of the models trained on PSSMs against similarly trained models on sparse-encoded single sequences. We found that the transient binding site predictions from evolutionary profiles are outperformed by single sequence based models under controlled training and test experiments by as much as 8 percentage points. Thus, we conclude that the PSSM-based models are inadequate to predict high specificity DNA-binding residues. These findings are of critical significance for the design of mutant- and species-specific DNA ligands and for homology based modeling of protein-DNA complexes.
Collapse
Affiliation(s)
- Ajay Arya
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, INDIA
| | - Dana Mary Varghese
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, INDIA
| | - Ajay Kumar Verma
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, INDIA
| | - Shandar Ahmad
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, INDIA.
| |
Collapse
|
17
|
A Comprehensive Review of Computation-Based Metal-Binding Prediction Approaches at the Residue Level. BIOMED RESEARCH INTERNATIONAL 2022; 2022:8965712. [PMID: 35402609 PMCID: PMC8989566 DOI: 10.1155/2022/8965712] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/02/2022] [Accepted: 03/04/2022] [Indexed: 12/29/2022]
Abstract
Clear evidence has shown that metal ions strongly connect and delicately tune the dynamic homeostasis in living bodies. They have been proved to be associated with protein structure, stability, regulation, and function. Even small changes in the concentration of metal ions can shift their effects from natural beneficial functions to harmful. This leads to degenerative diseases, malignant tumors, and cancers. Accurate characterizations and predictions of metalloproteins at the residue level promise informative clues to the investigation of intrinsic mechanisms of protein-metal ion interactions. Compared to biophysical or biochemical wet-lab technologies, computational methods provide open web interfaces of high-resolution databases and high-throughput predictors for efficient investigation of metal-binding residues. This review surveys and details 18 public databases of metal-protein binding. We collect a comprehensive set of 44 computation-based methods and classify them into four categories, namely, learning-, docking-, template-, and meta-based methods. We analyze the benchmark datasets, assessment criteria, feature construction, and algorithms. We also compare several methods on two benchmark testing datasets and include a discussion about currently publicly available predictive tools. Finally, we summarize the challenges and underlying limitations of the current studies and propose several prospective directions concerning the future development of the related databases and methods.
Collapse
|
18
|
Arif M, Ahmed S, Ge F, Kabir M, Khan YD, Yu DJ, Thafar M. StackACPred: Prediction of anticancer peptides by integrating optimized multiple feature descriptors with stacked ensemble approach. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS 2022; 220:104458. [DOI: 10.1016/j.chemolab.2021.104458] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
|
19
|
Herrera-Bravo J, Farías JG, Contreras FP, Herrera-Belén L, Norambuena JA, Beltrán JF. VirVACPRED: A Web Server for Prediction of Protective Viral Antigens. Int J Pept Res Ther 2021; 28:35. [PMID: 34934411 PMCID: PMC8679566 DOI: 10.1007/s10989-021-10345-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/07/2021] [Indexed: 11/25/2022]
Abstract
Viral antigens are key in the development of vaccines that prevent or eradicate infections caused by these pathogens. Bioinformatics tools are modern alternatives that facilitate the discovery of viral antigens, reducing the costs of experimental assays. We developed a bioinformatics tool called VirVACPRED, which is highly efficient in predicting viral antigens. In this study, we obtained a model based on the gradient boosting classifier, which showed high performance during the training, leave-one-out cross-validation (accuracy = 0.7402, sensitivity = 0.7319, precision = 0.7503, F1 = 0.7251, kappa = 0.4774, Matthews correlation coefficient = 0.4981) and testing (accuracy = 0.8889, sensitivity = 1.0, precision = 0.8276, F1 = 0.9057, kappa = 0.7734, Matthews correlation coefficient = 0.7941). VirVACPRED is a robust tool that can be of great help in the search and proposal of new viral antigens, which can be considered in the development of future vaccines against infections caused by viruses.
Collapse
Affiliation(s)
- Jesús Herrera-Bravo
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Santiago, Chile
- Center of Molecular Biology and Pharmacogenetics, Scientific and Technological Bioresource Nucleus, Universidad de La Frontera, Temuco, Chile
| | - Jorge G. Farías
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Fernanda Parraguez Contreras
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Lisandra Herrera-Belén
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Juan-Alejandro Norambuena
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
- Program on Natural Resources Sciences, Universidad de La Frontera, Avenida Francisco Salazar, 01145, P.O. Box 54-D, 4780000 Temuco, Chile
| | - Jorge F. Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| |
Collapse
|
20
|
Arslan E, Schulz J, Rai K. Machine Learning in Epigenomics: Insights into Cancer Biology and Medicine. Biochim Biophys Acta Rev Cancer 2021; 1876:188588. [PMID: 34245839 PMCID: PMC8595561 DOI: 10.1016/j.bbcan.2021.188588] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 05/29/2021] [Accepted: 07/02/2021] [Indexed: 02/01/2023]
Abstract
The recent deluge of genome-wide technologies for the mapping of the epigenome and resulting data in cancer samples has provided the opportunity for gaining insights into and understanding the roles of epigenetic processes in cancer. However, the complexity, high-dimensionality, sparsity, and noise associated with these data pose challenges for extensive integrative analyses. Machine Learning (ML) algorithms are particularly suited for epigenomic data analyses due to their flexibility and ability to learn underlying hidden structures. We will discuss four overlapping but distinct major categories under ML: dimensionality reduction, unsupervised methods, supervised methods, and deep learning (DL). We review the preferred use cases of these algorithms in analyses of cancer epigenomics data with the hope to provide an overview of how ML approaches can be used to explore fundamental questions on the roles of epigenome in cancer biology and medicine.
Collapse
Affiliation(s)
- Emre Arslan
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America
| | - Jonathan Schulz
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America
| | - Kunal Rai
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America.
| |
Collapse
|
21
|
Rauer C, Sen N, Waman VP, Abbasian M, Orengo CA. Computational approaches to predict protein functional families and functional sites. Curr Opin Struct Biol 2021; 70:108-122. [PMID: 34225010 DOI: 10.1016/j.sbi.2021.05.012] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 05/13/2021] [Accepted: 05/25/2021] [Indexed: 01/06/2023]
Abstract
Understanding the mechanisms of protein function is indispensable for many biological applications, such as protein engineering and drug design. However, experimental annotations are sparse, and therefore, theoretical strategies are needed to fill the gap. Here, we present the latest developments in building functional subclassifications of protein superfamilies and using evolutionary conservation to detect functional determinants, for example, catalytic-, binding- and specificity-determining residues important for delineating the functional families. We also briefly review other features exploited for functional site detection and new machine learning strategies for combining multiple features.
Collapse
Affiliation(s)
- Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Mahnaz Abbasian
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
22
|
Hendrix SG, Chang KY, Ryu Z, Xie ZR. DeepDISE: DNA Binding Site Prediction Using a Deep Learning Method. Int J Mol Sci 2021; 22:ijms22115510. [PMID: 34073705 PMCID: PMC8197219 DOI: 10.3390/ijms22115510] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 04/30/2021] [Accepted: 05/19/2021] [Indexed: 11/18/2022] Open
Abstract
It is essential for future research to develop a new, reliable prediction method of DNA binding sites because DNA binding sites on DNA-binding proteins provide critical clues about protein function and drug discovery. However, the current prediction methods of DNA binding sites have relatively poor accuracy. Using 3D coordinates and the atom-type of surface protein atom as the input, we trained and tested a deep learning model to predict how likely a voxel on the protein surface is to be a DNA-binding site. Based on three different evaluation datasets, the results show that our model not only outperforms several previous methods on two commonly used datasets, but also demonstrates its robust performance to be consistent among the three datasets. The visualized prediction outcomes show that the binding sites are also mostly located in correct regions. We successfully built a deep learning model to predict the DNA binding sites on target proteins. It demonstrates that 3D protein structures plus atom-type information on protein surfaces can be used to predict the potential binding sites on a protein. This approach should be further extended to develop the binding sites of other important biological molecules.
Collapse
Affiliation(s)
- Samuel Godfrey Hendrix
- Computational Drug Discovery Laboratory, School of Electrical and Computer Engineering, College of Engineering, University of Georgia, Athens, GA 30602, USA; (S.G.H.); (Z.R.)
| | - Kuan Y. Chang
- Department of Computer Science and Engineering, National Taiwan Ocean University, Keelung 202, Taiwan;
| | - Zeezoo Ryu
- Computational Drug Discovery Laboratory, School of Electrical and Computer Engineering, College of Engineering, University of Georgia, Athens, GA 30602, USA; (S.G.H.); (Z.R.)
- Department of Computer Science, Franklin College of Arts and Sciences, University of Georgia, Athens, GA 30602, USA
| | - Zhong-Ru Xie
- Computational Drug Discovery Laboratory, School of Electrical and Computer Engineering, College of Engineering, University of Georgia, Athens, GA 30602, USA; (S.G.H.); (Z.R.)
- Correspondence:
| |
Collapse
|
23
|
Chen YZ, Wang ZZ, Wang Y, Ying G, Chen Z, Song J. nhKcr: a new bioinformatics tool for predicting crotonylation sites on human nonhistone proteins based on deep learning. Brief Bioinform 2021; 22:6277413. [PMID: 34002774 DOI: 10.1093/bib/bbab146] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Revised: 03/18/2021] [Accepted: 03/25/2021] [Indexed: 12/20/2022] Open
Abstract
Lysine crotonylation (Kcr) is a newly discovered type of protein post-translational modification and has been reported to be involved in various pathophysiological processes. High-resolution mass spectrometry is the primary approach for identification of Kcr sites. However, experimental approaches for identifying Kcr sites are often time-consuming and expensive when compared with computational approaches. To date, several predictors for Kcr site prediction have been developed, most of which are capable of predicting crotonylation sites on either histones alone or mixed histone and nonhistone proteins together. These methods exhibit high diversity in their algorithms, encoding schemes, feature selection techniques and performance assessment strategies. However, none of them were designed for predicting Kcr sites on nonhistone proteins. Therefore, it is desirable to develop an effective predictor for identifying Kcr sites from the large amount of nonhistone sequence data. For this purpose, we first provide a comprehensive review on six methods for predicting crotonylation sites. Second, we develop a novel deep learning-based computational framework termed as CNNrgb for Kcr site prediction on nonhistone proteins by integrating different types of features. We benchmark its performance against multiple commonly used machine learning classifiers (including random forest, logitboost, naïve Bayes and logistic regression) by performing both 10-fold cross-validation and independent test. The results show that the proposed CNNrgb framework achieves the best performance with high computational efficiency on large datasets. Moreover, to facilitate users' efforts to investigate Kcr sites on human nonhistone proteins, we implement an online server called nhKcr and compare it with other existing tools to illustrate the utility and robustness of our method. The nhKcr web server and all the datasets utilized in this study are freely accessible at http://nhKcr.erc.monash.edu/.
Collapse
Affiliation(s)
- Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | | | | | - Guoguang Ying
- Laboratory of Tumor Cell Biology in Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Australia
| |
Collapse
|
24
|
Prasasty VD, Hutagalung RA, Gunadi R, Sofia DY, Rosmalena R, Yazid F, Sinaga E. Prediction of human-Streptococcus pneumoniae protein-protein interactions using logistic regression. Comput Biol Chem 2021; 92:107492. [PMID: 33964803 DOI: 10.1016/j.compbiolchem.2021.107492] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Accepted: 04/21/2021] [Indexed: 02/07/2023]
Abstract
Streptococcus pneumoniae is a major cause of mortality in children under five years old. In recent years, the emergence of antibiotic-resistant strains of S. pneumoniae increases the threat level of this pathogen. For that reason, the exploration of S. pneumoniae protein virulence factors should be considered in developing new drugs or vaccines, for instance by the analysis of host-pathogen protein-protein interactions (HP-PPIs). In this research, prediction of protein-protein interactions was performed with a logistic regression model with the number of protein domain occurrences as features. By utilizing HP-PPIs of three different pathogens as training data, the model achieved 57-77 % precision, 64-75 % recall, and 96-98 % specificity. Prediction of human-S. pneumoniae protein-protein interactions using the model yielded 5823 interactions involving thirty S. pneumoniae proteins and 324 human proteins. Pathway enrichment analysis showed that most of the pathways involved in the predicted interactions are immune system pathways. Network topology analysis revealed β-galactosidase (BgaA) as the most central among the S. pneumoniae proteins in the predicted HP-PPI networks, with a degree centrality of 1.0 and a betweenness centrality of 0.451853. Further experimental studies are required to validate the predicted interactions and examine their roles in S. pneumoniae infection.
Collapse
Affiliation(s)
- Vivitri Dewi Prasasty
- Faculty of Biotechnology, Atma Jaya Catholic University of Indonesia, Jakarta, 12930, Indonesia.
| | - Rory Anthony Hutagalung
- Faculty of Biotechnology, Atma Jaya Catholic University of Indonesia, Jakarta, 12930, Indonesia
| | - Reinhart Gunadi
- Department of Biology, Faculty of Life Sciences, Universitas Surya, Tangerang, Banten, 15143, Indonesia
| | - Dewi Yustika Sofia
- Department of Biology, Faculty of Life Sciences, Universitas Surya, Tangerang, Banten, 15143, Indonesia
| | - Rosmalena Rosmalena
- Department of Medical Chemistry, Faculty of Medicine, Universitas Indonesia, Jakarta, 10430, Indonesia
| | - Fatmawaty Yazid
- Department of Medical Chemistry, Faculty of Medicine, Universitas Indonesia, Jakarta, 10430, Indonesia
| | - Ernawati Sinaga
- Faculty of Biology, Universitas Nasional, Jakarta, 12520, Indonesia.
| |
Collapse
|
25
|
Development of machine learning model for diagnostic disease prediction based on laboratory tests. Sci Rep 2021; 11:7567. [PMID: 33828178 PMCID: PMC8026627 DOI: 10.1038/s41598-021-87171-5] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 03/19/2021] [Indexed: 01/16/2023] Open
Abstract
The use of deep learning and machine learning (ML) in medical science is increasing, particularly in the visual, audio, and language data fields. We aimed to build a new optimized ensemble model by blending a DNN (deep neural network) model with two ML models for disease prediction using laboratory test results. 86 attributes (laboratory tests) were selected from datasets based on value counts, clinical importance-related features, and missing values. We collected sample datasets on 5145 cases, including 326,686 laboratory test results. We investigated a total of 39 specific diseases based on the International Classification of Diseases, 10th revision (ICD-10) codes. These datasets were used to construct light gradient boosting machine (LightGBM) and extreme gradient boosting (XGBoost) ML models and a DNN model using TensorFlow. The optimized ensemble model achieved an F1-score of 81% and prediction accuracy of 92% for the five most common diseases. The deep learning and ML models showed differences in predictive power and disease classification patterns. We used a confusion matrix and analyzed feature importance using the SHAP value method. Our new ML model achieved high efficiency of disease prediction through classification of diseases. This study will be useful in the prediction and diagnosis of diseases.
Collapse
|
26
|
Zhang Y, Chen P, Gao Y, Ni J, Wang X. DBP-PSSM: Combination of Evolutionary Profiles with the XGBoost Algorithm to Improve the Identification of DNA-binding Proteins. Comb Chem High Throughput Screen 2020; 25:3-12. [PMID: 33238837 DOI: 10.2174/1386207323999201124203531] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 10/16/2020] [Accepted: 10/29/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND AND OBJECTIVE DNA-binding proteins play important roles in a variety of biological processes, such as gene transcription and regulation, DNA replication and repair, DNA recombination and packaging, and the formation of chromatin and ribosomes. Therefore, it is urgent to develop a computational method to improve the recognition efficiency of DNA-binding proteins. METHODS We proposed a novel method, DBP-PSSM, which constructed the features from amino acid composition and evolutionary information of protein sequences. The maximum relevance, minimum redundancy (mRMR) was employed to select the optimal features for establishing the XGBoost classifier, therefore, the novel model of prediction DNA-binding proteins, DBP-PSSM, was established with 5-fold cross-validation on the training dataset. RESULTS DBP-PSSM achieved an accuracy of 81.18% and MCC of 0.657 in a test dataset, which outperformed the many existing methods. These results demonstrated that our method can effectively predict DNA-binding proteins. CONCLUSION The data and source code are provided at https://github.com/784221489/DNA-binding.
Collapse
Affiliation(s)
- Yanping Zhang
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Pengcheng Chen
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Ya Gao
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Jianwei Ni
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Xiaosheng Wang
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| |
Collapse
|
27
|
Bianchi J, de Oliveira Ruellas AC, Gonçalves JR, Paniagua B, Prieto JC, Styner M, Li T, Zhu H, Sugai J, Giannobile W, Benavides E, Soki F, Yatabe M, Ashman L, Walker D, Soroushmehr R, Najarian K, Cevidanes LHS. Osteoarthritis of the Temporomandibular Joint can be diagnosed earlier using biomarkers and machine learning. Sci Rep 2020; 10:8012. [PMID: 32415284 PMCID: PMC7228972 DOI: 10.1038/s41598-020-64942-0] [Citation(s) in RCA: 66] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 04/21/2020] [Indexed: 12/26/2022] Open
Abstract
After chronic low back pain, Temporomandibular Joint (TMJ) disorders are the second most common musculoskeletal condition affecting 5 to 12% of the population, with an annual health cost estimated at $4 billion. Chronic disability in TMJ osteoarthritis (OA) increases with aging, and the main goal is to diagnosis before morphological degeneration occurs. Here, we address this challenge using advanced data science to capture, process and analyze 52 clinical, biological and high-resolution CBCT (radiomics) markers from TMJ OA patients and controls. We tested the diagnostic performance of four machine learning models: Logistic Regression, Random Forest, LightGBM, XGBoost. Headaches, Range of mouth opening without pain, Energy, Haralick Correlation, Entropy and interactions of TGF-β1 in Saliva and Headaches, VE-cadherin in Serum and Angiogenin in Saliva, VE-cadherin in Saliva and Headaches, PA1 in Saliva and Headaches, PA1 in Saliva and Range of mouth opening without pain; Gender and Muscle Soreness; Short Run Low Grey Level Emphasis and Headaches, Inverse Difference Moment and Trabecular Separation accurately diagnose early stages of this clinical condition. Our results show the XGBoost + LightGBM model with these features and interactions achieves the accuracy of 0.823, AUC 0.870, and F1-score 0.823 to diagnose the TMJ OA status. Thus, we expect to boost future studies into osteoarthritis patient-specific therapeutic interventions, and thereby improve the health of articular joints.
Collapse
Affiliation(s)
- Jonas Bianchi
- University of Michigan, Department of Orthodontics and Pediatric Dentistry, School of Dentistry, Ann Arbor, MI, 48109, USA.
- São Paulo State University (UNESP), Department of Pediatric Dentistry, School of Dentistry, Araraquara, SP, 14801-385, Brazil.
| | | | - João Roberto Gonçalves
- São Paulo State University (UNESP), Department of Pediatric Dentistry, School of Dentistry, Araraquara, SP, 14801-385, Brazil
| | | | - Juan Carlos Prieto
- University of North Carolina, Department of Psychiatry and Computer Science, Chapel Hill, NC, 27516, USA
| | - Martin Styner
- University of North Carolina, Department of Psychiatry and Computer Science, Chapel Hill, NC, 27516, USA
| | - Tengfei Li
- University of North Carolina, Department of Biostatistics, Chapel Hill, NC, 27516, USA
| | - Hongtu Zhu
- University of North Carolina, Department of Biostatistics, Chapel Hill, NC, 27516, USA
| | - James Sugai
- University of Michigan, Department of Periodontics and Oral Medicine, School of Dentistry, Ann Arbor, MI, 48109, USA
| | - William Giannobile
- University of Michigan, Department of Periodontics and Oral Medicine, School of Dentistry, Ann Arbor, MI, 48109, USA
| | - Erika Benavides
- University of Michigan, Department of Periodontics and Oral Medicine, School of Dentistry, Ann Arbor, MI, 48109, USA
| | - Fabiana Soki
- University of Michigan, Department of Periodontics and Oral Medicine, School of Dentistry, Ann Arbor, MI, 48109, USA
| | - Marilia Yatabe
- University of Michigan, Department of Orthodontics and Pediatric Dentistry, School of Dentistry, Ann Arbor, MI, 48109, USA
| | - Lawrence Ashman
- University of Michigan, Department of Oral and Maxillofacial Surgery and Hospital Dentistry, School of Dentistry, Ann Arbor, MI, 48109, USA
| | - David Walker
- University of North Carolina, Department of Orthodontics, Chapel Hill, NC, 27516, USA
| | - Reza Soroushmehr
- University of Michigan, Center for Integrative Research in Critical Care and Michigan Institute for Data Science, Department of Computational Medicine and Bioinformatics, Ann Arbor, MI, 48109, USA
| | - Kayvan Najarian
- University of Michigan, Center for Integrative Research in Critical Care and Michigan Institute for Data Science, Department of Computational Medicine and Bioinformatics, Ann Arbor, MI, 48109, USA
| | - Lucia Helena Soares Cevidanes
- University of Michigan, Department of Orthodontics and Pediatric Dentistry, School of Dentistry, Ann Arbor, MI, 48109, USA
| |
Collapse
|
28
|
Li F, Chen J, Ge Z, Wen Y, Yue Y, Hayashida M, Baggag A, Bensmail H, Song J. Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework. Brief Bioinform 2020; 22:2126-2140. [PMID: 32363397 DOI: 10.1093/bib/bbaa049] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 02/25/2020] [Accepted: 03/11/2020] [Indexed: 12/12/2022] Open
Abstract
Promoters are short consensus sequences of DNA, which are responsible for transcription activation or the repression of all genes. There are many types of promoters in bacteria with important roles in initiating gene transcription. Therefore, solving promoter-identification problems has important implications for improving the understanding of their functions. To this end, computational methods targeting promoter classification have been established; however, their performance remains unsatisfactory. In this study, we present a novel stacked-ensemble approach (termed SELECTOR) for identifying both promoters and their respective classification. SELECTOR combined the composition of k-spaced nucleic acid pairs, parallel correlation pseudo-dinucleotide composition, position-specific trinucleotide propensity based on single-strand, and DNA strand features and using five popular tree-based ensemble learning algorithms to build a stacked model. Both 5-fold cross-validation tests using benchmark datasets and independent tests using the newly collected independent test dataset showed that SELECTOR outperformed state-of-the-art methods in both general and specific types of promoter prediction in Escherichia coli. Furthermore, this novel framework provides essential interpretations that aid understanding of model success by leveraging the powerful Shapley Additive exPlanation algorithm, thereby highlighting the most important features relevant for predicting both general and specific types of promoters and overcoming the limitations of existing 'Black-box' approaches that are unable to reveal causal relationships from large amounts of initially encoded features.
Collapse
Affiliation(s)
- Fuyi Li
- Northwest A&F University, China.,Department of Biochemistry and Molecular Biology and the Infection and Immunity Program, Biomedicine Discovery Institute, Monash University, Australia
| | - Jinxiang Chen
- Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University from the College of Information Engineering, Northwest A&F University, China
| | - Zongyuan Ge
- Monash University and also serves as a Deep Learning Specialist at NVIDIA AI Technology Centre. Before joining Monash, he was a research scientist at IBM Research Australia doing research in medical AI during 2016-2018. His research interests are AI, computer vision, medical image, robotics and deep learning
| | - Ya Wen
- computer technology from Ningxia University, China
| | - Yanwei Yue
- medical science from Southern Medical University, China
| | - Morihiro Hayashida
- informatics from Kyoto University, Japan, in 2005. He is an Assistant Professor in the Department of Electrical Engineering and Computer Science, National Institute of Technology, Matsue College, Japan
| | - Abdelkader Baggag
- computer science from the University of Minnesota. He is a Senior Scientist at the Qatar Computing Research Institute (QCRI) and has a joint appointment as an Associate Professor at Hamad Bin Khalifa University (HBKU) in the Division of Information and Computing Technology. His research interests include data mining, linear algebra and machine learning
| | - Halima Bensmail
- University of Pierre & Marie Currie (Paris 6) in France. She is currently a Principal Scientist at QCRI-HBKU and a joint Associate Professor at the College of Computer and Science Engineering, HBKU
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Australia. He is also affiliated with the Monash Centre for Data Science, Faculty of Information Technology, Monash University. His research interests include bioinformatics, computational biology, machine learning, data mining, and pattern recognition
| |
Collapse
|
29
|
Qiu H, Luo L, Su Z, Zhou L, Wang L, Chen Y. Machine learning approaches to predict peak demand days of cardiovascular admissions considering environmental exposure. BMC Med Inform Decis Mak 2020; 20:83. [PMID: 32357880 PMCID: PMC7195717 DOI: 10.1186/s12911-020-1101-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 04/23/2020] [Indexed: 02/05/2023] Open
Abstract
Background Accumulating evidence has linked environmental exposure, such as ambient air pollution and meteorological factors, to the development and severity of cardiovascular diseases (CVDs), resulting in increased healthcare demand. Effective prediction of demand for healthcare services, particularly those associated with peak events of CVDs, can be useful in optimizing the allocation of medical resources. However, few studies have attempted to adopt machine learning approaches with excellent predictive abilities to forecast the healthcare demand for CVDs. This study aims to develop and compare several machine learning models in predicting the peak demand days of CVDs admissions using the hospital admissions data, air quality data and meteorological data in Chengdu, China from 2015 to 2017. Methods Six machine learning algorithms, including logistic regression (LR), support vector machine (SVM), artificial neural network (ANN), random forest (RF), extreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM) were applied to build the predictive models with a unique feature set. The area under a receiver operating characteristic curve (AUC), logarithmic loss function, accuracy, sensitivity, specificity, precision, and F1 score were used to evaluate the predictive performances of the six models. Results The LightGBM model exhibited the highest AUC (0.940, 95% CI: 0.900–0.980), which was significantly higher than that of LR (0.842, 95% CI: 0.783–0.901), SVM (0.834, 95% CI: 0.774–0.894) and ANN (0.890, 95% CI: 0.836–0.944), but did not differ significantly from that of RF (0.926, 95% CI: 0.879–0.974) and XGBoost (0.930, 95% CI: 0.878–0.982). In addition, the LightGBM has the optimal logarithmic loss function (0.218), accuracy (91.3%), specificity (94.1%), precision (0.695), and F1 score (0.725). Feature importance identification indicated that the contribution rate of meteorological conditions and air pollutants for the prediction was 32 and 43%, respectively. Conclusion This study suggests that ensemble learning models, especially the LightGBM model, can be used to effectively predict the peak events of CVDs admissions, and therefore could be a very useful decision-making tool for medical resource management.
Collapse
Affiliation(s)
- Hang Qiu
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, No.2006, Xiyuan Ave, West Hi-Tech Zone, 611731, Chengdu, Sichuan, P.R. China. .,Big Data Research Center, University of Electronic Science and Technology of China, Chengdu, China.
| | - Lin Luo
- Big Data Research Center, University of Electronic Science and Technology of China, Chengdu, China
| | - Ziqi Su
- Department of Statistics, Faculty of Science, University of British Columbia, Vancouver, Canada
| | - Li Zhou
- Health Information Center of Sichuan Province, Chengdu, China
| | - Liya Wang
- Big Data Research Center, University of Electronic Science and Technology of China, Chengdu, China
| | - Yucheng Chen
- Cardiology Division, West China Hospital, Sichuan University, Chengdu, China.,West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
30
|
Hyland SL, Faltys M, Hüser M, Lyu X, Gumbsch T, Esteban C, Bock C, Horn M, Moor M, Rieck B, Zimmermann M, Bodenham D, Borgwardt K, Rätsch G, Merz TM. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat Med 2020; 26:364-373. [DOI: 10.1038/s41591-020-0789-4] [Citation(s) in RCA: 196] [Impact Index Per Article: 39.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Accepted: 02/04/2020] [Indexed: 01/12/2023]
|
31
|
PreDBA: A heterogeneous ensemble approach for predicting protein-DNA binding affinity. Sci Rep 2020; 10:1278. [PMID: 31992738 PMCID: PMC6987227 DOI: 10.1038/s41598-020-57778-1] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Accepted: 01/06/2020] [Indexed: 11/17/2022] Open
Abstract
The interaction between protein and DNA plays an essential function in various critical natural processes, like DNA replication, transcription, splicing, and repair. Studying the binding affinity of proteins to DNA helps to understand the recognition mechanism of protein-DNA complexes. Since there are still many limitations on the protein-DNA binding affinity data measured by experiments, accurate and reliable calculation methods are necessarily required. So we put forward a computational approach in this paper, called PreDBA, that can forecast protein-DNA binding affinity effectively by using heterogeneous ensemble models. One hundred protein-DNA complexes are manually collected from the related literature as a data set for protein-DNA binding affinity. Then, 52 sequence and structural features are obtained. Based on this, the correlation between these 52 characteristics and protein-DNA binding affinity is calculated. Furthermore, we found that the protein-DNA binding affinity is affected by the DNA molecule structure of the compound. We classify all protein-DNA compounds into five classifications based on the DNA structure related to the proteins that make up the protein-DNA complexes. In each group, a stacked heterogeneous ensemble model is constructed based on the obtained features. In the end, based on the binding affinity data set, we used the leave-one-out cross-validation to evaluate the proposed method comprehensively. In the five categories, the Pearson correlation coefficient values of our recommended method range from 0.735 to 0.926. We have demonstrated the advantages of the proposed method compared to other machine learning methods and currently existing protein-DNA binding affinity prediction approach.
Collapse
|
32
|
Real-time automatic surgical phase recognition in laparoscopic sigmoidectomy using the convolutional neural network-based deep learning approach. Surg Endosc 2019; 34:4924-4931. [PMID: 31797047 DOI: 10.1007/s00464-019-07281-0] [Citation(s) in RCA: 78] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2019] [Accepted: 11/23/2019] [Indexed: 02/06/2023]
Abstract
BACKGROUND Automatic surgical workflow recognition is a key component for developing the context-aware computer-assisted surgery (CA-CAS) systems. However, automatic surgical phase recognition focused on colorectal surgery has not been reported. We aimed to develop a deep learning model for automatic surgical phase recognition based on laparoscopic sigmoidectomy (Lap-S) videos, which could be used for real-time phase recognition, and to clarify the accuracies of the automatic surgical phase and action recognitions using visual information. METHODS The dataset used contained 71 cases of Lap-S. The video data were divided into frame units every 1/30 s as static images. Every Lap-S video was manually divided into 11 surgical phases (Phases 0-10) and manually annotated for each surgical action on every frame. The model was generated based on the training data. Validation of the model was performed on a set of unseen test data. Convolutional neural network (CNN)-based deep learning was also used. RESULTS The average surgical time was 175 min (± 43 min SD), with the individual surgical phases also showing high variations in the duration between cases. Each surgery started in the first phase (Phase 0) and ended in the last phase (Phase 10), and phase transitions occurred 14 (± 2 SD) times per procedure on an average. The accuracy of the automatic surgical phase recognition was 91.9% and those for the automatic surgical action recognition of extracorporeal action and irrigation were 89.4% and 82.5%, respectively. Moreover, this system could perform real-time automatic surgical phase recognition at 32 fps. CONCLUSIONS The CNN-based deep learning approach enabled the recognition of surgical phases and actions in 71 Lap-S cases based on manually annotated data. This system could perform automatic surgical phase recognition and automatic target surgical action recognition with high accuracy. Moreover, this study showed the feasibility of real-time automatic surgical phase recognition with high frame rate.
Collapse
|