1
|
Smith VC, Gonzalez Hernandez F, Wattanakul T, Chotsiri P, Cordero JA, Ballester MR, Duran M, Fanlo Escudero O, Lilaonitkul W, Standing JF, Kloprogge F. An automated classification pipeline for tables in pharmacokinetic literature. Sci Rep 2025; 15:10071. [PMID: 40128567 PMCID: PMC11933424 DOI: 10.1038/s41598-025-94778-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2024] [Accepted: 03/17/2025] [Indexed: 03/26/2025] Open
Abstract
Pharmacokinetic (PK) models are essential for optimising drug candidate selection and dosing regimens in drug development. Preclinical and population PK models benefit from integrating prior knowledge from existing compounds. While tables in scientific literature contain comprehensive prior PK data and critical contextual information, the lack of automated extraction tools forces researchers to manually curate datasets, limiting efficiency and scalability. This study addresses this gap by focusing on the crucial first step of PK table mining: automatically identifying tables containing in vivo PK parameters and study population characteristics. To this end, an expert-annotated corpus of 2640 tables from PK literature was developed and used to train a supervised classification pipeline. The pipeline integrates diverse table features and representations, with GPT-4 refining predictions in uncertain cases. The resulting model achieved F1 scores exceeding 96% across all classes. The pipeline was applied to PK papers from PubMed Central Open-Access, with results integrated into the PK paper search tool at www.pkpdai.com . This work establishes a foundational step towards automating PK table data extraction and streamlining dataset curation. The corpus and code are openly available.
Collapse
Affiliation(s)
- Victoria C Smith
- Institute of Health Informatics, University College London, London, UK.
- Great Ormond Street Institute for Child Health, University College London, London, UK.
| | | | - Thanaporn Wattanakul
- Mahidol Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Bangkok, Thailand
| | - Palang Chotsiri
- Clinical Pharmacology, Modelling and Simulation, Parexel International, Bangkok, Thailand
| | | | - Maria Rosa Ballester
- Blanquerna School of Health Sciences, Ramon Llull University, Barcelona, Spain
- Institut de Recerca Sant Pau Barcelona, Barcelona, Spain
| | - Màrius Duran
- Blanquerna School of Health Sciences, Ramon Llull University, Barcelona, Spain
| | - Olga Fanlo Escudero
- Blanquerna School of Health Sciences, Ramon Llull University, Barcelona, Spain
| | | | - Joseph F Standing
- Great Ormond Street Institute for Child Health, University College London, London, UK
- Department of Pharmacy, Great Ormond Street Hospital for Children, London, UK
| | - Frank Kloprogge
- Institute for Global Health, University College London, London, UK
| |
Collapse
|
2
|
Nehmeh B, Rebehmed J, Nehmeh R, Taleb R, Akoury E. Unlocking therapeutic frontiers: harnessing artificial intelligence in drug discovery for neurodegenerative diseases. Drug Discov Today 2024; 29:104216. [PMID: 39428082 DOI: 10.1016/j.drudis.2024.104216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 10/05/2024] [Accepted: 10/15/2024] [Indexed: 10/22/2024]
Abstract
Neurodegenerative diseases (NDs) pose serious healthcare challenges with limited therapeutic treatments and high social burdens. The integration of artificial intelligence (AI) into drug discovery has emerged as a promising approach to address these challenges. This review explores the application of AI techniques to unravel therapeutic frontiers for NDs. We examine the current landscape of AI-driven drug discovery and discuss the potentials of AI in accelerating the identification of novel therapeutic targets on ND research and drug development, optimization of drug candidates, and expediating personalized medicine approaches. Finally, we outline future directions and challenges in harnessing AI for the advancement of therapeutics in this critical area by emphasizing the importance of interdisciplinary collaboration and ethical considerations.
Collapse
Affiliation(s)
- Bilal Nehmeh
- Department of Physical Sciences, Lebanese American University, Beirut 1102-2801, Lebanon
| | - Joseph Rebehmed
- Department of Computer Science and Mathematics, Lebanese American University, Beirut 1102-2801, Lebanon
| | - Riham Nehmeh
- INSA Rennes, Institut d'électronique et de Télécommunications de Rennes IETR, UMR 6164, 35708 Rennes, France
| | - Robin Taleb
- Department of Physical Sciences, Lebanese American University, Byblos Campus, Blat, 4M8F+6QF, Lebanon
| | - Elias Akoury
- Department of Physical Sciences, Lebanese American University, Beirut 1102-2801, Lebanon.
| |
Collapse
|
3
|
Li H, Su D, Zhang X, He Y, Luo X, Xiong Y, Zou M, Wei H, Wen S, Xi Q, Zuo Y, Yang L. Machine learning-based prediction of diabetic patients using blood routine data. Methods 2024; 229:156-162. [PMID: 39019099 DOI: 10.1016/j.ymeth.2024.07.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 06/23/2024] [Accepted: 07/10/2024] [Indexed: 07/19/2024] Open
Abstract
Diabetes stands as one of the most prevalent chronic diseases globally. The conventional methods for diagnosing diabetes are frequently overlooked until individuals manifest noticeable symptoms of the condition. This study aimed to address this gap by collecting comprehensive datasets, including 1000 instances of blood routine data from diabetes patients and an equivalent dataset from healthy individuals. To differentiate diabetes patients from their healthy counterparts, a computational framework was established, encompassing eXtreme Gradient Boosting (XGBoost), random forest, support vector machine, and elastic net algorithms. Notably, the XGBoost model emerged as the most effective, exhibiting superior predictive results with an area under the receiver operating characteristic curve (AUC) of 99.90% in the training set and 98.51% in the testing set. Moreover, the model showcased commendable performance during external validation, achieving an overall accuracy of 81.54%. The probability generated by the model serves as a risk score for diabetes susceptibility. Further interpretability was achieved through the utilization of the Shapley additive explanations (SHAP) algorithm, identifying pivotal indicators such as mean corpuscular hemoglobin concentration (MCHC), lymphocyte ratio (LY%), standard deviation of red blood cell distribution width (RDW-SD), and mean corpuscular hemoglobin (MCH). This enhances our understanding of the predictive mechanisms underlying diabetes. To facilitate the application in clinical and real-life settings, a nomogram was created based on the logistic regression algorithm, which can provide a preliminary assessment of the likelihood of an individual having diabetes. Overall, this research contributes valuable insights into the predictive modeling of diabetes, offering potential applications in clinical practice for more effective and timely diagnoses.
Collapse
Affiliation(s)
- Honghao Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Dongqing Su
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Xinpeng Zhang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Yuanyuan He
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Xu Luo
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Yuqiang Xiong
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Min Zou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Huiyan Wei
- Biotechnology Experimental Center, Harbin Medical University, Harbin 150081, China
| | - Shaoran Wen
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Qilemuge Xi
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Yongchun Zuo
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China; Inner Mongolia International Mongolian Hospital, Hohhot 010065, China.
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China.
| |
Collapse
|
4
|
Saha S, Chatterjee P, Basu S, Nasipuri M. EPI-SF: essential protein identification in protein interaction networks using sequence features. PeerJ 2024; 12:e17010. [PMID: 38495766 PMCID: PMC10944162 DOI: 10.7717/peerj.17010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Accepted: 02/05/2024] [Indexed: 03/19/2024] Open
Abstract
Proteins are considered indispensable for facilitating an organism's viability, reproductive capabilities, and other fundamental physiological functions. Conventional biological assays are characterized by prolonged duration, extensive labor requirements, and financial expenses in order to identify essential proteins. Therefore, it is widely accepted that employing computational methods is the most expeditious and effective approach to successfully discerning essential proteins. Despite being a popular choice in machine learning (ML) applications, the deep learning (DL) method is not suggested for this specific research work based on sequence features due to the restricted availability of high-quality training sets of positive and negative samples. However, some DL works on limited availability of data are also executed at recent times which will be our future scope of work. Conventional ML techniques are thus utilized in this work due to their superior performance compared to DL methodologies. In consideration of the aforementioned, a technique called EPI-SF is proposed here, which employs ML to identify essential proteins within the protein-protein interaction network (PPIN). The protein sequence is the primary determinant of protein structure and function. So, initially, relevant protein sequence features are extracted from the proteins within the PPIN. These features are subsequently utilized as input for various machine learning models, including XGB Boost Classifier, AdaBoost Classifier, logistic regression (LR), support vector classification (SVM), Decision Tree model (DT), Random Forest model (RF), and Naïve Bayes model (NB). The objective is to detect the essential proteins within the PPIN. The primary investigation conducted on yeast examined the performance of various ML models for yeast PPIN. Among these models, the RF model technique had the highest level of effectiveness, as indicated by its precision, recall, F1-score, and AUC values of 0.703, 0.720, 0.711, and 0.745, respectively. It is also found to be better in performance when compared to the other state-of-arts based on traditional centrality like betweenness centrality (BC), closeness centrality (CC), etc. and deep learning methods as well like DeepEP, as emphasized in the result section. As a result of its favorable performance, EPI-SF is later employed for the prediction of novel essential proteins inside the human PPIN. Due to the tendency of viruses to selectively target essential proteins involved in the transmission of diseases within human PPIN, investigations are conducted to assess the probable involvement of these proteins in COVID-19 and other related severe diseases.
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Techno Main Salt Lake, Kolkata, West Bengal, India
| | - Piyali Chatterjee
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Subhadip Basu
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Mita Nasipuri
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, West Bengal, India
| |
Collapse
|
5
|
Xie L, Xie Y, Wu Q, He J, Lin X, Qiu Z, Chen L. A predictive model for postoperative adverse outcomes following surgical treatment of acute type A aortic dissection based on machine learning. J Clin Hypertens (Greenwich) 2024; 26:251-261. [PMID: 38341621 PMCID: PMC10918704 DOI: 10.1111/jch.14774] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Revised: 12/10/2023] [Accepted: 12/17/2023] [Indexed: 02/12/2024]
Abstract
Acute type A aortic dissection (AAAD) has a high probability of postoperative adverse outcomes (PAO) after emergency surgery, so exploring the risk factors for PAO during hospitalization is key to reducing postoperative mortality and improving prognosis. An artificial intelligence approach was used to build a predictive model of PAO by clinical data-driven machine learning to predict the incidence of PAO after total arch repair for AAAD. This study included 380 patients with AAAD. The clinical features that are associated with PAO were selected using the LASSO regression analysis. Six different machine learning algorithms were tried for modeling, and the performance of each model was analyzed comprehensively using receiver operating characteristic curves, calibration curve, precision recall curve, and decision analysis curves. Explain the optimal model through Shapley Additive Explanation (SHAP) and perform an individualized risk assessment. After comprehensive analysis, the authors believe that the extreme gradient boosting (XGBoost) model is the optimal model, with better performance than other models. The authors successfully built a prediction model for PAO in AAAD patients based on the XGBoost algorithm and interpreted the model with the SHAP method, which helps to identify high-risk AAAD patients at an early stage and to adjust individual patient-related clinical treatment plans in a timely manner.
Collapse
Affiliation(s)
- Lin‐feng Xie
- Department of Cardiovascular SurgeryFujian Medical University Union HospitalFuzhouFujianP.R. China
- Key Laboratory of Cardio‐Thoracic SurgeryFujian Province UniversityFuzhouFujianP.R. China
- Fujian Provincial Center for Cardiovascular MedicineFuzhouFujianP.R. China
| | - Yu‐ling Xie
- Department of Cardiovascular SurgeryFujian Medical University Union HospitalFuzhouFujianP.R. China
- Key Laboratory of Cardio‐Thoracic SurgeryFujian Province UniversityFuzhouFujianP.R. China
- Fujian Provincial Center for Cardiovascular MedicineFuzhouFujianP.R. China
| | - Qing‐song Wu
- Department of Cardiovascular SurgeryFujian Medical University Union HospitalFuzhouFujianP.R. China
- Key Laboratory of Cardio‐Thoracic SurgeryFujian Province UniversityFuzhouFujianP.R. China
- Fujian Provincial Center for Cardiovascular MedicineFuzhouFujianP.R. China
| | - Jian He
- Department of Cardiovascular SurgeryFujian Medical University Union HospitalFuzhouFujianP.R. China
- Key Laboratory of Cardio‐Thoracic SurgeryFujian Province UniversityFuzhouFujianP.R. China
- Fujian Provincial Center for Cardiovascular MedicineFuzhouFujianP.R. China
| | - Xin‐fan Lin
- Department of Cardiovascular SurgeryFujian Medical University Union HospitalFuzhouFujianP.R. China
- Key Laboratory of Cardio‐Thoracic SurgeryFujian Province UniversityFuzhouFujianP.R. China
- Fujian Provincial Center for Cardiovascular MedicineFuzhouFujianP.R. China
| | - Zhi‐huang Qiu
- Department of Cardiovascular SurgeryFujian Medical University Union HospitalFuzhouFujianP.R. China
- Key Laboratory of Cardio‐Thoracic SurgeryFujian Province UniversityFuzhouFujianP.R. China
- Fujian Provincial Center for Cardiovascular MedicineFuzhouFujianP.R. China
| | - Liang‐wan Chen
- Department of Cardiovascular SurgeryFujian Medical University Union HospitalFuzhouFujianP.R. China
- Key Laboratory of Cardio‐Thoracic SurgeryFujian Province UniversityFuzhouFujianP.R. China
- Fujian Provincial Center for Cardiovascular MedicineFuzhouFujianP.R. China
| |
Collapse
|
6
|
Ye C, Wu Q, Chen S, Zhang X, Xu W, Wu Y, Zhang Y, Yue Y. ECDEP: identifying essential proteins based on evolutionary community discovery and subcellular localization. BMC Genomics 2024; 25:117. [PMID: 38279081 PMCID: PMC10821549 DOI: 10.1186/s12864-024-10019-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 01/15/2024] [Indexed: 01/28/2024] Open
Abstract
BACKGROUND In cellular activities, essential proteins play a vital role and are instrumental in comprehending fundamental biological necessities and identifying pathogenic genes. Current deep learning approaches for predicting essential proteins underutilize the potential of gene expression data and are inadequate for the exploration of dynamic networks with limited evaluation across diverse species. RESULTS We introduce ECDEP, an essential protein identification model based on evolutionary community discovery. ECDEP integrates temporal gene expression data with a protein-protein interaction (PPI) network and employs the 3-Sigma rule to eliminate outliers at each time point, constructing a dynamic network. Next, we utilize edge birth and death information to establish an interaction streaming source to feed into the evolutionary community discovery algorithm and then identify overlapping communities during the evolution of the dynamic network. SVM recursive feature elimination (RFE) is applied to extract the most informative communities, which are combined with subcellular localization data for classification predictions. We assess the performance of ECDEP by comparing it against ten centrality methods, four shallow machine learning methods with RFE, and two deep learning methods that incorporate multiple biological data sources on Saccharomyces. Cerevisiae (S. cerevisiae), Homo sapiens (H. sapiens), Mus musculus, and Caenorhabditis elegans. ECDEP achieves an AP value of 0.86 on the H. sapiens dataset and the contribution ratio of community features in classification reaches 0.54 on the S. cerevisiae (Krogan) dataset. CONCLUSIONS Our proposed method adeptly integrates network dynamics and yields outstanding results across various datasets. Furthermore, the incorporation of evolutionary community discovery algorithms amplifies the capacity of gene expression data in classification.
Collapse
Affiliation(s)
- Chen Ye
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Qi Wu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Shuxia Chen
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Xuemei Zhang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Wenwen Xu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Yunzhi Wu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Youhua Zhang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China
| | - Yi Yue
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, 230036, China.
- Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
7
|
Arif M, Fang G, Fida H, Musleh S, Yu DJ, Alam T. iMRSAPred: Improved Prediction of Anti-MRSA Peptides Using Physicochemical and Pairwise Contact-Energy Properties of Amino Acids. ACS OMEGA 2024; 9:2874-2883. [PMID: 38250405 PMCID: PMC10795061 DOI: 10.1021/acsomega.3c08303] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Revised: 12/06/2023] [Accepted: 12/13/2023] [Indexed: 01/23/2024]
Abstract
Methicillin-resistant Staphylococcus aureus (MRSA) is a growing concern for human lives worldwide. Anti-MRSA peptides act as potential antibiotic agents and play significant role to combat MRSA infection. Traditional laboratory-based methods for annotating Anti-MRSA peptides are although precise but quite challenging, costly, and time-consuming. Therefore, computational methods capable of identifying Anti-MRSA peptides accelerate the drug designing process for treating bacterial infections. In this study, we developed a novel sequence-based predictor "iMRSAPred" for screening Anti-MRSA peptides by incorporating energy estimation and physiochemical and sequential information. We successfully resolved the skewed imbalance phenomena by using synthetic minority oversampling technique plus Tomek link (SMOTETomek) algorithm. Furthermore, the Shapley additive explanation method was leveraged to analyze the impact of top-ranked features in the prediction task. We evaluated multiple machine learning algorithms, i.e., CatBoost, Cascade Deep Forest, Kernel and Tree Boosting, support vector machine, and HistGBoost classifiers by 10-fold cross-validation and independent testing. The proposed iMRSAPred method significantly improved the overall performance in terms of accuracy and Matthew's correlation coefficient (MCC) by 5.45 and 0.083%, respectively, on the training data set. On the independent data set, iMRSAPred improved accuracy and MCC by 3.98 and 0.055%, respectively. We believe that the proposed method would be useful in large-scale Anti-MRSA peptide prediction and provide insights into other bioactive peptides.
Collapse
Affiliation(s)
- Muhammad Arif
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
| | - Ge Fang
- State
Key Laboratory for Organic Electronics and Information Displays, Institute of Advanced Materials (IAM), Nanjing University of Posts Telecommunications
9 Wenyuan Road, Nanjing 210023, P. R. China
- Center
for Research Innovation and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bankok 10700, Thailand
| | - Huma Fida
- Department
of Microbiology, Abdul Wali Khan University, Mardan 23200, KPK, Pakistan
| | - Saleh Musleh
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
| | - Dong-Jun Yu
- School
of Computer Science and Engineering, Nanjing
University of Science and Technology, Nanjing 210023, China
| | - Tanvir Alam
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
| |
Collapse
|
8
|
Payra AK, Saha B, Ghosh A. MEM-FET: Essential protein prediction using membership feature and machine learning approach. Proteins 2024; 92:60-75. [PMID: 37638618 DOI: 10.1002/prot.26577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 02/21/2023] [Accepted: 08/08/2023] [Indexed: 08/29/2023]
Abstract
Proteins are played key roles in different functionalities in our daily life. All functional roles of a protein are a bit enhanced in interaction compared to individuals. Identification of essential proteins of an organism is a time consume and costly task during observation in the wet lab. The results of observation in wet lab always ensure high reliability and accuracy in the biological ground. Essential protein prediction using computational approaches is an alternative choice in research. It proves its significance rapidly in day-to-day life as well as reduces the experimental cost of wet lab effectively. Existing computational methods were implemented using Protein interaction networks (PPIN), Sequence, Gene Expression Dataset (GED), Gene Ontology (GO), Orthologous groups, and Subcellular localized datasets. Machine learning has diverse categories of features that enable to model and predict essential macromolecules of understudied organisms. A novel methodology MEM-FET (membership feature) is predicted based on features, that is, edge clustering coefficient, Average clustering coefficient, subcellular localization, and Gene Ontology within a compartment of common neighbors. The accuracy (ACC) values of the predicted true positive (TP) essential proteins are 0.79, 0.74, 0.78, and 0.71 for YHQ, YMIPS, YDIP, and YMBD datasets. An enriched set of essential proteins are also predicted using the MEM-FET algorithm. Ensemble ML also validated the proposed model with an accuracy of 60%. It has been predicted that MEM-FET algorithms outperform other existing algorithms with an ACC value of 80% for the yeast dataset.
Collapse
Affiliation(s)
- Anjan Kumar Payra
- Department of Computer Science and Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, Kolkata, India
| | - Banani Saha
- Department of Computer Science and Engineering, University of Calcutta, Kolkata, India
| | - Anupam Ghosh
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| |
Collapse
|
9
|
Zou M, Li H, Su D, Xiong Y, Wei H, Wang S, Sun H, Wang T, Xi Q, Zuo Y, Yang L. Integrating somatic mutation profiles with structural deep clustering network for metabolic stratification in pancreatic cancer: a comprehensive analysis of prognostic and genomic landscapes. Brief Bioinform 2023; 25:bbad430. [PMID: 38040491 PMCID: PMC10783866 DOI: 10.1093/bib/bbad430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 09/29/2023] [Accepted: 11/05/2023] [Indexed: 12/03/2023] Open
Abstract
Pancreatic cancer is a globally recognized highly aggressive malignancy, posing a significant threat to human health and characterized by pronounced heterogeneity. In recent years, researchers have uncovered that the development and progression of cancer are often attributed to the accumulation of somatic mutations within cells. However, cancer somatic mutation data exhibit characteristics such as high dimensionality and sparsity, which pose new challenges in utilizing these data effectively. In this study, we propagated the discrete somatic mutation data of pancreatic cancer through a network propagation model based on protein-protein interaction networks. This resulted in smoothed somatic mutation profile data that incorporate protein network information. Based on this smoothed mutation profile data, we obtained the activity levels of different metabolic pathways in pancreatic cancer patients. Subsequently, using the activity levels of various metabolic pathways in cancer patients, we employed a deep clustering algorithm to establish biologically and clinically relevant metabolic subtypes of pancreatic cancer. Our study holds scientific significance in classifying pancreatic cancer based on somatic mutation data and may provide a crucial theoretical basis for the diagnosis and immunotherapy of pancreatic cancer patients.
Collapse
Affiliation(s)
- Min Zou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Honghao Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Dongqing Su
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Yuqiang Xiong
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Haodong Wei
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Shiyuan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Hongmei Sun
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Tao Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Qilemuge Xi
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Yongchun Zuo
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
- Digital College, Inner Mongolia Intelligent Union Big Data Academy, Inner Mongolia Wesure Date Technology Co., Ltd. Hohhot 010010, China
- Inner Mongolia International Mongolian Hospital, Hohhot 010065, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| |
Collapse
|
10
|
Luo KH, Wu CH, Yang CC, Chen TH, Tu HP, Yang CH, Chuang HY. Exploring the association of metal mixture in blood to the kidney function and tumor necrosis factor alpha using machine learning methods. ECOTOXICOLOGY AND ENVIRONMENTAL SAFETY 2023; 265:115528. [PMID: 37783110 DOI: 10.1016/j.ecoenv.2023.115528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 09/09/2023] [Accepted: 09/25/2023] [Indexed: 10/04/2023]
Abstract
This research aimed to approach relationships between metal mixture in blood and kidney function, tumor necrosis factor alpha (TNF-α) by machine learning. Metals levels were measured by Inductively Couple Plasma Mass Spectrometry in blood from 421 participants. We applied K Nearest Neighbor (KNN), Naive Bayes classifier (NB), Support Vector Machines (SVM), random forest (RF), Gradient Boosting Decision Tree (GBDT), Categorical boosting (CatBoost), eXtreme Gradient Boosting (XGBoost), Whale Optimization-based XGBoost (WXGBoost) to identify the effect of plasma metals, TNF-α, and estimated glomerular filtration rate (eGFR by CKD-EPI equation). We conducted not only toxic metals, lead (Pb), arsenic (As), cadmium (Cd) but also included trace essential metals, selenium (Se), copper (Cu), zinc (Zn), cobalt (Co), to predict the interaction of TNF-α, TNF-α/white blood count, and eGFR. The high average TNF-α level group was observed among subjects with higher Pb, As, Cd, Cu, and Zn levels in blood. No associations were shown between the low and high TNF-α level group in blood Se and Co levels. Those with lower eGFR group had high Pb, As, Cd, Co, Cu, and Zn levels. The crucial predictor of TNF-α level in metals was blood Pb, and then Cd, As, Cu, Se, Zn and Co. The machine learning revealed that As was the major role among predictors of eGFR after feature selection. The levels of kidney function and TNF-α were modified by co-exposure metals. We were able to acquire highest accuracy of over 85% in the multi-metals exposure model. The higher Pb and Zn levels had strongest interaction with declined eGFR. In addition, As and Cd had synergistic with prediction model of TNF-α. We explored the potential of machine learning approaches for predicting health outcomes with multi-metal exposure. XGBoost model added SHAP could give an explicit explanation of individualized and precision risk prediction and insight of the interaction of key features in the multi-metal exposure.
Collapse
Affiliation(s)
- Kuei-Hau Luo
- Graduate Institute of Medicine, College of Medicine, Kaohsiung Medicine University, Kaohsiung City 807, Taiwan
| | - Chih-Hsien Wu
- Department of Electronic Engineering, National Kaohsiung University of Science and Technology, Kaohsiung 80778, Taiwan
| | - Chen-Cheng Yang
- Graduate Institute of Medicine, College of Medicine, Kaohsiung Medicine University, Kaohsiung City 807, Taiwan; Department of Occupational Medicine, Kaohsiung Municipal Siaogang Hospital, Kaohsiung Medical University, Kaohsiung 812, Taiwan
| | - Tzu-Hua Chen
- Department of Family Medicine, Kaohsiung Municipal Ta-Tung Hospital, Kaohsiung 801, Taiwan
| | - Hung-Pin Tu
- Department of Public Health and Environmental Medicine, School of Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung 807, Taiwan
| | - Cheng-Hong Yang
- Department of Electronic Engineering, National Kaohsiung University of Science and Technology, Kaohsiung 80778, Taiwan; Department of Information Management, Tainan University of Technology, Tainan 71002, Taiwan; Drug Development and Value Creation Research Center, Kaohsiung Medical University, Kaohsiung 80708, Taiwan; Ph. D. Program in Biomedical Engineering, Kaohsiung Medical University, Kaohsiung 80708, Taiwan; School of Dentistry, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
| | - Hung-Yi Chuang
- Graduate Institute of Medicine, College of Medicine, Kaohsiung Medicine University, Kaohsiung City 807, Taiwan; Department of Public Health and Environmental Medicine, School of Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung 807, Taiwan; Department of Occupational and Environmental Medicine, Kaohsiung Medicine University Hospital, Kaohsiung Medicine University, Kaohsiung City 807, Taiwan; Ph.D. Program in Environmental and Occupational Medicine, and Research Center for Precision Environmental Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung 807, Taiwan.
| |
Collapse
|
11
|
Soria C, Arroyo Y, Torres AM, Redondo MÁ, Basar C, Mateo J. Method for Classifying Schizophrenia Patients Based on Machine Learning. J Clin Med 2023; 12:4375. [PMID: 37445410 DOI: 10.3390/jcm12134375] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 06/21/2023] [Accepted: 06/27/2023] [Indexed: 07/15/2023] Open
Abstract
Schizophrenia is a chronic and severe mental disorder that affects individuals in various ways, particularly in their ability to perceive, process, and respond to stimuli. This condition has a significant impact on a considerable number of individuals. Consequently, the study, analysis, and characterization of this pathology are of paramount importance. Electroencephalography (EEG) is frequently utilized in the diagnostic assessment of various brain disorders due to its non-intrusiveness, excellent resolution and ease of placement. However, the manual analysis of electroencephalogram (EEG) recordings can be a complex and time-consuming task for healthcare professionals. Therefore, the automated analysis of EEG recordings can help alleviate the burden on doctors and provide valuable insights to support clinical diagnosis. Many studies are working along these lines. In this research paper, the authors propose a machine learning (ML) method based on the eXtreme Gradient Boosting (XGB) algorithm for analyzing EEG signals. The study compares the performance of the proposed XGB-based approach with four other supervised ML systems. According to the results, the proposed XGB-based method demonstrates superior performance, with an AUC value of 0.94 and an accuracy value of 0.94, surpassing the other compared methods. The implemented system exhibits high accuracy and robustness in accurately classifying schizophrenia patients based on EEG recordings. This method holds the potential to be implemented as a valuable complementary tool for clinical use in hospitals, supporting clinicians in their clinical diagnosis of schizophrenia.
Collapse
Affiliation(s)
- Carmen Soria
- Institute of Technology, University of Castilla-La Mancha, 16071 Cuenca, Spain
- Clinical Neurophysiology Service, Virgen de la Luz Hospital, 16002 Cuenca, Spain
| | - Yoel Arroyo
- Faculty of Social Sciences and Information Technology, University of Castilla-La Mancha, 45600 Talavera de la Reina, Spain
| | - Ana María Torres
- Institute of Technology, University of Castilla-La Mancha, 16071 Cuenca, Spain
| | - Miguel Ángel Redondo
- School of Informatics, University of Castilla-La Mancha, 13071 Ciudad Real, Spain
| | - Christoph Basar
- Faculty of Human and Health Sciences, University of Bremen, 28359 Bremen, Germany
| | - Jorge Mateo
- Institute of Technology, University of Castilla-La Mancha, 16071 Cuenca, Spain
| |
Collapse
|
12
|
Tian J, Yan J, Han G, Du Y, Hu X, He Z, Han Q, Zhang Y. Machine learning prognosis model based on patient-reported outcomes for chronic heart failure patients after discharge. Health Qual Life Outcomes 2023; 21:31. [PMID: 36978124 PMCID: PMC10053412 DOI: 10.1186/s12955-023-02109-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2022] [Accepted: 03/03/2023] [Indexed: 03/30/2023] Open
Abstract
BACKGROUND Patient-reported outcomes (PROs) can be obtained outside hospitals and are of great significance for evaluation of patients with chronic heart failure (CHF). The aim of this study was to establish a prediction model using PROs for out-of-hospital patients. METHODS CHF-PRO were collected in 941 patients with CHF from a prospective cohort. Primary endpoints were all-cause mortality, HF hospitalization, and major adverse cardiovascular events (MACEs). To establish prognosis models during the two years follow-up, six machine learning methods were used, including logistic regression, random forest classifier, extreme gradient boosting (XGBoost), light gradient boosting machine, naive bayes, and multilayer perceptron. Models were established in four steps, namely, using general information as predictors, using four domains of CHF-PRO, using both of them and adjusting the parameters. The discrimination and calibration were then estimated. Further analyze were performed for the best model. The top prediction variables were further assessed. The Shapley additive explanations (SHAP) method was used to explain black boxes of the models. Moreover, a self-made web-based risk calculator was established to facilitate the clinical application. RESULTS CHF-PRO showed strong prediction value and improved the performance of the models. Among the approaches, XGBoost of the parameter adjustment model had the highest prediction performance with an area under the curve of 0.754 (95% CI: 0.737 to 0.761) for death, 0.718 (95% CI: 0.717 to 0.721) for HF rehospitalization and 0.670 (95% CI: 0.595 to 0.710) for MACEs. The four domains of CHF-PRO, especially the physical domain, showed the most significant impact on the prediction of outcomes. CONCLUSION CHF-PRO showed strong prediction value in the models. The XGBoost models using variables based on CHF-PRO and the patient's general information provide prognostic assessment for patients with CHF. The self-made web-based risk calculator can be conveniently used to predict the prognosis for patients after discharge. CLINICAL TRIAL REGISTRATION URL: http://www.chictr.org.cn/index.aspx ; Unique identifier: ChiCTR2100043337.
Collapse
Affiliation(s)
- Jing Tian
- Department of Cardiology, the 1st Hospital of Shanxi Medical University, 85 South Jiefang Road, Taiyuan, Shanxi Province, 030001, China
- Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, 56 South XinJian Road, Taiyuan, Shanxi Province, 030001, China
| | - Jingjing Yan
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi Province, 030001, China
| | - Gangfei Han
- Department of Cardiology, the 1st Hospital of Shanxi Medical University, 85 South Jiefang Road, Taiyuan, Shanxi Province, 030001, China
| | - Yutao Du
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi Province, 030001, China
| | - Xiaojuan Hu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi Province, 030001, China
| | - Zixuan He
- Department of Cardiology, the 1st Hospital of Shanxi Medical University, 85 South Jiefang Road, Taiyuan, Shanxi Province, 030001, China
| | - Qinghua Han
- Department of Cardiology, the 1st Hospital of Shanxi Medical University, 85 South Jiefang Road, Taiyuan, Shanxi Province, 030001, China.
| | - Yanbo Zhang
- Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, 56 South XinJian Road, Taiyuan, Shanxi Province, 030001, China.
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi Province, 030001, China.
- Shanxi University of Chinese Medicine, 121 University Street, Jinzhong, Shanxi Province, 030619, China.
| |
Collapse
|
13
|
Pavithra A, Kalpana G, Vigneswaran T. Deep learning-based automated disease detection and classification model for precision agriculture. Soft comput 2023. [DOI: 10.1007/s00500-023-07936-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
|
14
|
Yuan Y, Chen Z. The impacts of land cover spatial combination on nighttime light intensity in 2010 and 2020: a case study of Fuzhou, China. COMPUTATIONAL URBAN SCIENCE 2023. [DOI: 10.1007/s43762-023-00077-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
AbstractAs human activities highly depend on the land resources and changed the land cover (LC) condition, the relationship between LC and nighttime light (NTL) intensity has been widely analyzed to support the foundation of NTL applications and help explain the drivers of urban economic development. However, previous studies always paid attention to the effect of each LC type on NTL intensity, with limited consideration of the joint effects of any two LC types. To fill this gap, this study measured the land cover spatial combination (LCSC) by using a spatial adjacency matrix, and then analyzed its impacts on NTL intensity based on an extreme gradient boosting (XGBoost) regression model with the assistant of sharpley additive explanations (SHAP) method. Our results presented that the LCSC can better (R2 of 82.4% and 98.1% in 2010 and 2020) explain the relationship between LC and NTL intensity with the traditional LC metrics (e.g., area and patch count), since the LCSC is much more sensitive to the diverse land functions. It is noteworthy that the impacts, as well as their dynamics, of LCSC between any two LC types on NTL intensity are various. LCSC associated with artificial surface contributed more to NTL intensity. In detail, the LCSC of water/wetland and artificial surface can increasingly promote the NTL intensity while the LCSC of grassland/forest and artificial surface has a decreasing or inverse U-shaped contribution to NTL intensity. Whereas LCSC associated with non-artificial surface were not conducive to the increase in NTL intensity due to high vegetation density. We also provided three implications to help further urbanization process and discussed the applications of LCSC.
Collapse
|
15
|
Zhang H, Chi M, Su D, Xiong Y, Wei H, Yu Y, Zuo Y, Yang L. A random forest-based metabolic risk model to assess the prognosis and metabolism-related drug targets in ovarian cancer. Comput Biol Med 2023; 153:106432. [PMID: 36608460 DOI: 10.1016/j.compbiomed.2022.106432] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Revised: 11/13/2022] [Accepted: 12/13/2022] [Indexed: 12/23/2022]
Abstract
As one of the most common gynecologic malignant tumors, ovarian cancer is usually diagnosed at an advanced and incurable stage because of its early asymptomatic onset. Increasing research into tumor biology has demonstrated that abnormal cellular metabolism precedes tumorigenesis, therefore it has become an area of active research in academia. Cellular metabolism is of great significance in cancer diagnostic and prognostic studies. In this study, we integrated The Cancer Genome Atlas dataset with multiple Gene Expression Omnibus ovarian cancer datasets, identified 17 metabolic pathways with prognostic values using the random forest algorithm, constructed a metabolic risk scoring model based on metabolic pathway enrichment scores, and classified patients with ovarian cancer into two subtypes. Then, we systematically investigated the differences between different subtypes in terms of prognosis, differential gene expression, immune signature enrichment, Hallmark signature enrichment, and somatic mutations. As well, we successfully predicted differences in sensitivity to immunotherapy and chemotherapy drugs in patients with different metabolic risk subtypes. Moreover, we identified 5 drug targets associated with high metabolic risk and low metabolic risk ovarian cancer phenotypes through the weighted correlation network analysis and investigated their roles in the genesis of ovarian cancer. Finally, we developed an XGBoost classifier for predicting metabolic risk types in patients with ovarian cancer, producing a good predictive effect. In light of the above study, the research findings will provide valuable information for prognostic prediction and personalized medical treatment of patients with ovarian cancer.
Collapse
Affiliation(s)
- Haoxin Zhang
- Department of Gastrointestinal Oncology, Harbin Medical University Cancer Hospital, Harbin, 150081, China
| | - Meng Chi
- Department of Anesthesiology, Harbin Medical University Cancer Hospital, Harbin, 150081, China
| | - Dongqing Su
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Yuqiang Xiong
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Haodong Wei
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Yao Yu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Yongchun Zuo
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China; Digital College, Inner Mongolia Intelligent Union Big Data Academy, Inner Mongolia Wesure Date Technology Co., Ltd, Hohhot, 010010, China.
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China.
| |
Collapse
|
16
|
Liu S, Chen R, Gu Y, Yu Q, Su G, Ren Y, Huang L, Zhou F. AcneTyper: An automatic diagnosis method of dermoscopic acne image via self-ensemble and stacking. Technol Health Care 2022:THC220295. [PMID: 36617797 DOI: 10.3233/thc-220295] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
BACKGROUND Acne is a skin lesion type widely existing in adolescents, and poses computational challenges for automatic diagnosis. Computer vision algorithms are utilized to detect and determine different subtypes of acne. Most of the existing acne detection algorithms are based on the facial natural images, which carry noisy factors like illuminations. OBJECTIVE In order to tackle this issue, this study collected a dataset ACNEDer of dermoscopic acne images with annotations. Deep learning methods have demonstrated powerful capabilities in automatic acne diagnosis, and they usually release the training epoch with the best performance as the delivered model. METHODS This study proposes a novel self-ensemble and stacking-based framework AcneTyper for diagnosing the acne subtypes. Instead of delivering the best epoch, AcneTyper consolidates the prediction results of all training epochs as the latent features and stacks the best subset of these latent features for distinguishing different acne subtypes. RESULTS The proposed AcneTyper framework achieves a promising detection performance of acne subtypes and even outperforms a clinical dermatologist with two-year experiences by 6.8% in accuracy. CONCLUSION The method we proposed is used to determine different subtypes of acne and outperforms inexperienced dermatologists and contributes to reducing the probability of misdiagnosis.
Collapse
Affiliation(s)
- Shuai Liu
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Ruili Chen
- Department of Dermatology and Venereology, The First Hospital of Jilin University, Changchun, Jilin, China
| | - Yun Gu
- Department of Dermatology and Venereology, The First Hospital of Jilin University, Changchun, Jilin, China
| | - Qiong Yu
- Department of Epidemiology and Biostatistics, School of Public Health, Jilin University, Changchun, Jilin, China
| | - Guoxiong Su
- Beijing Dr. of Acne Medical Research Institute, Beijing, China
| | - Yanjiao Ren
- College of Information Technology (Smart Agriculture Research Institute), Jilin Agricultural University, Changchun, Jilin, China
| | - Lan Huang
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Fengfeng Zhou
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun, Jilin, China
| |
Collapse
|
17
|
Zhu L, Wang X, Li F, Song J. PreAcrs: a machine learning framework for identifying anti-CRISPR proteins. BMC Bioinformatics 2022; 23:444. [PMID: 36284264 PMCID: PMC9597991 DOI: 10.1186/s12859-022-04986-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Accepted: 10/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Anti-CRISPR proteins are potent modulators that inhibit the CRISPR-Cas immunity system and have huge potential in gene editing and gene therapy as a genome-editing tool. Extensive studies have shown that anti-CRISPR proteins are essential for modifying endogenous genes, promoting the RNA-guided binding and cleavage of DNA or RNA substrates. In recent years, identifying and characterizing anti-CRISPR proteins has become a hot and significant research topic in bioinformatics. However, as most anti-CRISPR proteins fall short in sharing similarities to those currently known, traditional screening methods are time-consuming and inefficient. Machine learning methods could fill this gap with powerful predictive capability and provide a new perspective for anti-CRISPR protein identification. RESULTS Here, we present a novel machine learning ensemble predictor, called PreAcrs, to identify anti-CRISPR proteins from protein sequences directly. Three features and eight different machine learning algorithms were used to train PreAcrs. PreAcrs outperformed other existing methods and significantly improved the prediction accuracy for identifying anti-CRISPR proteins. CONCLUSIONS In summary, the PreAcrs predictor achieved a competitive performance for predicting new anti-CRISPR proteins in terms of accuracy and robustness. We anticipate PreAcrs will be a valuable tool for researchers to speed up the research process. The source code is available at: https://github.com/Lyn-666/anti_CRISPR.git .
Collapse
Affiliation(s)
- Lin Zhu
- Institute for Advanced Study, Shenzhen University, Shenzhen, China
| | - Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800 Australia
| | - Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800 Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800 Australia
| |
Collapse
|
18
|
Sun C, Tian H, Mazurczyk W, Chang C, Cai Y, Chen Y. Towards blind detection of steganography in low‐bit‐rate speech streams. INT J INTELL SYST 2022. [DOI: 10.1002/int.23077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Congcong Sun
- College of Computer Science and Technology National Huaqiao University Xiamen China
- Xiamen Key Laboratory of Data Security and Blockchain Technology National Huaqiao University Xiamen China
- Fujian Key Laboratory of Big Data Intelligence and Security National Huaqiao University Xiamen China
| | - Hui Tian
- College of Computer Science and Technology National Huaqiao University Xiamen China
- Xiamen Key Laboratory of Data Security and Blockchain Technology National Huaqiao University Xiamen China
- Fujian Key Laboratory of Big Data Intelligence and Security National Huaqiao University Xiamen China
| | - Wojciech Mazurczyk
- Institute of Computer Science Warsaw University of Technology Warsaw Poland
| | - Chin‐Chen Chang
- Department of Information and Computer Science Feng Chia University Taichung Taiwan
| | - Yiqiao Cai
- College of Computer Science and Technology National Huaqiao University Xiamen China
- Xiamen Key Laboratory of Data Security and Blockchain Technology National Huaqiao University Xiamen China
- Fujian Key Laboratory of Big Data Intelligence and Security National Huaqiao University Xiamen China
| | - Yonghong Chen
- College of Computer Science and Technology National Huaqiao University Xiamen China
- Xiamen Key Laboratory of Data Security and Blockchain Technology National Huaqiao University Xiamen China
- Fujian Key Laboratory of Big Data Intelligence and Security National Huaqiao University Xiamen China
| |
Collapse
|
19
|
Yue Y, Ye C, Peng PY, Zhai HX, Ahmad I, Xia C, Wu YZ, Zhang YH. A deep learning framework for identifying essential proteins based on multiple biological information. BMC Bioinformatics 2022; 23:318. [PMID: 35927611 PMCID: PMC9351218 DOI: 10.1186/s12859-022-04868-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 07/29/2022] [Indexed: 11/15/2022] Open
Abstract
Background Essential Proteins are demonstrated to exert vital functions on cellular processes and are indispensable for the survival and reproduction of the organism. Traditional centrality methods perform poorly on complex protein–protein interaction (PPI) networks. Machine learning approaches based on high-throughput data lack the exploitation of the temporal and spatial dimensions of biological information. Results We put forward a deep learning framework to predict essential proteins by integrating features obtained from the PPI network, subcellular localization, and gene expression profiles. In our model, the node2vec method is applied to learn continuous feature representations for proteins in the PPI network, which capture the diversity of connectivity patterns in the network. The concept of depthwise separable convolution is employed on gene expression profiles to extract properties and observe the trends of gene expression over time under different experimental conditions. Subcellular localization information is mapped into a long one-dimensional vector to capture its characteristics. Additionally, we use a sampling method to mitigate the impact of imbalanced learning when training the model. With experiments carried out on the data of Saccharomyces cerevisiae, results show that our model outperforms traditional centrality methods and machine learning methods. Likewise, the comparative experiments have manifested that our process of various biological information is preferable. Conclusions Our proposed deep learning framework effectively identifies essential proteins by integrating multiple biological data, proving a broader selection of subcellular localization information significantly improves the results of prediction and depthwise separable convolution implemented on gene expression profiles enhances the performance.
Collapse
Affiliation(s)
- Yi Yue
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China. .,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China. .,State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University, Hefei, 230036, China.
| | - Chen Ye
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Pei-Yun Peng
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Hui-Xin Zhai
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Iftikhar Ahmad
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Chuan Xia
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Yun-Zhi Wu
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China.,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China.,State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University, Hefei, 230036, China
| | - You-Hua Zhang
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, China. .,School of Information and Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
20
|
Ramón A, Torres AM, Milara J, Cascón J, Blasco P, Mateo J. eXtreme Gradient Boosting-based method to classify patients with COVID-19. J Investig Med 2022; 70:jim-2021-002278. [PMID: 35850970 DOI: 10.1136/jim-2021-002278] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/15/2022] [Indexed: 01/08/2023]
Abstract
Different demographic, clinical and laboratory variables have been related to the severity and mortality following SARS-CoV-2 infection. Most studies applied traditional statistical methods and in some cases combined with a machine learning (ML) method. This is the first study to date to comparatively analyze five ML methods to select the one that most closely predicts mortality in patients admitted with COVID-19. The aim of this single-center observational study is to classify, based on different types of variables, adult patients with COVID-19 at increased risk of mortality. SARS-CoV-2 infection was defined by a positive reverse transcriptase PCR. A total of 203 patients were admitted between March 15 and June 15, 2020 to a tertiary hospital. Data were extracted from the electronic medical record. Four supervised ML algorithms (k-nearest neighbors (KNN), decision tree (DT), Gaussian naïve Bayes (GNB) and support vector machine (SVM)) were compared with the eXtreme Gradient Boosting (XGB) method proposed to have excellent scalability and high running speed, among other qualities. The results indicate that the XGB method has the best prediction accuracy (92%), high precision (>0.92) and high recall (>0.92). The KNN, SVM and DT approaches present moderate prediction accuracy (>80%), moderate recall (>0.80) and moderate precision (>0.80). The GNB algorithm shows relatively low classification performance. The variables with the greatest weight in predicting mortality were C reactive protein, procalcitonin, glutamyl oxaloacetic transaminase, glutamyl pyruvic transaminase, neutrophils, D-dimer, creatinine, lactic acid, ferritin, days of non-invasive ventilation, septic shock and age. Based on these results, XGB is a solid candidate for correct classification of patients with COVID-19.
Collapse
Affiliation(s)
- Antonio Ramón
- Pharmacy Department, General University Hospital Consortium of Valencia, Valencia, Spain
| | - Ana Maria Torres
- Institute of Technology, Universidad de Castilla-La Mancha, Cuenca, Spain
| | - Javier Milara
- Pharmacy Department, General University Hospital Consortium of Valencia, Valencia, Spain
- Pharmacy Department, University of Valencia, Valencia, Spain
| | - Joaquín Cascón
- Institute of Technology, Universidad de Castilla-La Mancha, Cuenca, Spain
| | - Pilar Blasco
- Pharmacy Department, General University Hospital Consortium of Valencia, Valencia, Spain
| | - Jorge Mateo
- Institute of Technology, Universidad de Castilla-La Mancha, Cuenca, Spain
| |
Collapse
|
21
|
Prediction of Gestational Diabetes Mellitus under Cascade and Ensemble Learning Algorithm. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:3212738. [PMID: 35875747 PMCID: PMC9303101 DOI: 10.1155/2022/3212738] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 05/18/2022] [Accepted: 05/30/2022] [Indexed: 11/17/2022]
Abstract
Gestational diabetes mellitus (GDM) is one of the risk factors for fetal dysplasia and maternal pregnancy difficulties. Therefore, the prediction of the risk of GDM in advance has become a big demand for millions of families. Therefore, machine learning technology is adopted to study GDM prediction. Firstly, the data is preprocessed, and the mean value is used for outlier processing. After preprocessing of the data, the IV value method is used to screen the features. Of the 83 features in the original sample data, 40 important features are screened out through feature engineering. On this basis, Logistics regression model, Lasso-Logistics, Gradient Boosting Decision Tree (GBDT), Extreme Gradient Boosting (Xgboost), Light Gradient Boosting Machine (Lightgbm), and Gradient Boosting Categorical Features (Catboost) are established, and multiple learners are integrated. Finally, the constructed model is tested on data sets. The accuracy of the proposed model is 80.3%, the accuracy is 74.6%, the recall rate is 79.3%, and the running time is only 2.53 seconds. This means that the proposed model is superior to the previous models in terms of accuracy, precision, recall rate, and F1 value, and the time consumption is also in line with the actual engineering requirements. The proposed scheme provides some ideas for the research of machine learning technology in disease prediction.
Collapse
|
22
|
Torkamannia A, Omidi Y, Ferdousi R. A review of machine learning approaches for drug synergy prediction in cancer. Brief Bioinform 2022; 23:6552269. [PMID: 35323854 DOI: 10.1093/bib/bbac075] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 01/19/2022] [Accepted: 02/14/2022] [Indexed: 02/06/2023] Open
Abstract
Combinational pharmacotherapy with the synergistic/additive effect is a powerful treatment strategy for complex diseases such as malignancies. Identifying synergistic combinations with various compounds and structures requires testing a large number of compound combinations. However, in practice, examining different compounds by in vivo and in vitro approaches is costly, infeasible and challenging. In the last decades, significant success has been achieved by expanding computational methods in different pharmacological and bioinformatics domains. As promising tools, computational approaches such as machine learning algorithms (MLAs) are used for prioritizing combinational pharmacotherapies. This review aims to provide the models developed to predict synergistic drug combinations in cancer by MLAs with various information, including gene expression, protein-protein interactions, metabolite interactions, pathways and pharmaceutical information such as chemical structure, molecular descriptor and drug-target interactions.
Collapse
Affiliation(s)
- Anna Torkamannia
- Department of Health Information Technology, School of Management and Medical Informatics, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Yadollah Omidi
- Department of Pharmaceutical Sciences, College of Pharmacy, Nova Southeastern University, Fort Lauderdale, Florida, United States
| | - Reza Ferdousi
- Department of Health Information Technology, School of Management and Medical Informatics, Tabriz University of Medical Sciences, Tabriz, Iran
| |
Collapse
|
23
|
Chen XG, Liu S, Zhang W. Predicting Coding Potential of RNA Sequences by Solving Local Data Imbalance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1075-1083. [PMID: 32886613 DOI: 10.1109/tcbb.2020.3021800] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Non-coding RNAs (ncRNAs)play an important role in various biological processes and are associated with diseases. Distinguishing between coding RNAs and ncRNAs, also known as predicting coding potential of RNA sequences, is critical for downstream biological function analysis. Many machine learning-based methods have been proposed for predicting coding potential of RNA sequences. Recent studies reveal that most existing methods have poor performance on RNA sequences with short Open Reading Frames (sORF, ORF length<303nt). In this work, we analyze the distribution of ORF length of RNA sequences, and observe that the number of coding RNAs with sORF is inadequate and coding RNAs with sORF are much less than ncRNAs with sORF. Thus, there exists the problem of local data imbalance in RNA sequences with sORF. We propose a coding potential prediction method CPE-SLDI, which uses data oversampling techniques to augment samples for coding RNAs with sORF so as to alleviate local data imbalance. Compared with existing methods, CPE-SLDI produces the better performances, and studies reveal that data augmentation by various data oversampling techniques can enhance the performance of coding potential prediction, especially for RNA sequences with sORF. The implementation of the proposed method is available at https://github.com/chenxgscuec/CPESLDI.
Collapse
|
24
|
Li C, Tian C, Zeng Y, Liang J, Yang Q, Gu F, Hu Y, Liu L. Machine learning and bioinformatics analysis revealed classification and potential treatment strategy in stage 3-4 NSCLC patients. BMC Med Genomics 2022; 15:33. [PMID: 35193578 PMCID: PMC8862473 DOI: 10.1186/s12920-022-01184-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Accepted: 02/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Precision medicine has increased the accuracy of cancer diagnosis and treatment, especially in the era of cancer immunotherapy. Despite recent advances in cancer immunotherapy, the overall survival rate of advanced NSCLC patients remains low. A better classification in advanced NSCLC is important for developing more effective treatments. METHOD The calculation of abundances of tumor-infiltrating immune cells (TIICs) was conducted using Cell-type Identification By Estimating Relative Subsets Of RNA Transcripts (CIBERSORT), xCell (xCELL), Tumor IMmune Estimation Resource (TIMER), Estimate the Proportion of Immune and Cancer cells (EPIC), and Microenvironment Cell Populations-counter (MCP-counter). K-means clustering was used to classify patients, and four machine learning methods (SVM, Randomforest, Adaboost, Xgboost) were used to build the classifiers. Multi-omics datasets (including transcriptomics, DNA methylation, copy number alterations, miRNA profile) and ICI immunotherapy treatment cohorts were obtained from various databases. The drug sensitivity data were derived from PRISM and CTRP databases. RESULTS In this study, patients with stage 3-4 NSCLC were divided into three clusters according to the abundance of TIICs, and we established classifiers to distinguish these clusters based on different machine learning algorithms (including SVM, RF, Xgboost, and Adaboost). Patients in cluster-2 were found to have a survival advantage and might have a favorable response to immunotherapy. We then constructed an immune-related Poor Prognosis Signature which could successfully predict the advanced NSCLC patient survival, and through epigenetic analysis, we found 3 key molecules (HSPA8, CREB1, RAP1A) which might serve as potential therapeutic targets in cluster-1. In the end, after screening of drug sensitivity data derived from CTRP and PRISM databases, we identified several compounds which might serve as medication for different clusters. CONCLUSIONS Our study has not only depicted the landscape of different clusters of stage 3-4 NSCLC but presented a treatment strategy for patients with advanced NSCLC.
Collapse
Affiliation(s)
- Chang Li
- Cancer Center, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Chen Tian
- Cancer Center, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Yulan Zeng
- Cancer Center, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Jinyan Liang
- Department of Ultrasound, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Qifan Yang
- Cancer Center, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Feifei Gu
- Cancer Center, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Yue Hu
- Cancer Center, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China.
| | - Li Liu
- Cancer Center, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China.
| |
Collapse
|
25
|
Li H, Shi L, Gao W, Zhang Z, Zhang L, Wang G. dPromoter-XGBoost: Detecting promoters and strength by combining multiple descriptors and feature selection using XGBoost. Methods 2022; 204:215-222. [PMID: 34998983 DOI: 10.1016/j.ymeth.2022.01.001] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 12/13/2021] [Accepted: 01/02/2022] [Indexed: 12/12/2022] Open
Abstract
Promoters play an irreplaceable role in biological processes and genetics, which are responsible for stimulating the transcription and expression of specific genes. Promoter abnormalities have been found in some diseases, and the level of promoter-binding transcription factors can be used as a marker before a disease occurs. Hence, detecting promoters from DNA sequences has important biological significance, particular, distinguishing strong promoters can help to elucidate differences in gene expression and the mechanisms of specific diseases. With the introduction of third-generation sequencing, it is difficult to match the speed of sequencing to the speed of labeling promoters experimentally. Many computing models have been designed to fill this gap and identify unlabeled DNA. However, their feature representation methods are very singular, which cannot reflect the information contained in the original samples. With the aim of avoiding information loss, we propose a computational model based on multiple descriptors and feature selection to jointly express samples. It is worth mentioning that a new feature descriptor called K-mer word vector is defined. The promoter model of multiple feature descriptors dominated by K-mer word vector achieves similar performance to existing methods, the sensitivity of 85.72% can distinguish the promoter more effectively than other methods. Furthermore, the performance of the promoter strength has surpassed published methods, and accuracy of 77.00% greatly improves the ability to distinguish between strong and weak promoters.
Collapse
Affiliation(s)
- Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China; Yangtze Delta Region Institute, University of Electronic Science and Technology, Quzhou,China
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zixiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China.
| |
Collapse
|
26
|
Arif M, Ahmed S, Ge F, Kabir M, Khan YD, Yu DJ, Thafar M. StackACPred: Prediction of anticancer peptides by integrating optimized multiple feature descriptors with stacked ensemble approach. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS 2022; 220:104458. [DOI: 10.1016/j.chemolab.2021.104458] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
|
27
|
Su D, Wang S, Xi Q, Lin L, Lu Q, Yu Y, Xiong Y, Wei H, Liang P, Lv Y, Zuo Y, Yang L. Prognostic and predictive value of a metabolic risk score model in breast cancer: an immunogenomic landscape analysis. Brief Funct Genomics 2021; 21:128-141. [PMID: 34755827 DOI: 10.1093/bfgp/elab040] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 10/12/2021] [Accepted: 10/15/2021] [Indexed: 01/21/2023] Open
Abstract
Breast cancer is a kind of malignant tumor that occurs in breast tissue, which is the most common cancer in women. Cellular metabolism is a critical determinant of the viability and function of cancer cells in tumor microenvironment. In this study, based on the gene expression profile of metabolism-related genes, the prognostic value of 20 metabolic pathways in patients with breast cancer was identified. A universal risk stratification signature that relies on 20 metabolic pathways was established and validated in training cohort, two testing cohorts and The Cancer Genome Atlas pan cancer cohort. Then, the relationship between metabolic risk score subtype, prognosis, immune infiltration level, cancer genotypes and their impact on therapeutic benefit were characterized. Results demonstrated that the patients with the low metabolic risk score subtype displayed good prognosis, high level of immune infiltration and exhibited a favorable response to neoadjuvant chemotherapy and immunotherapy. Taken together, the work presented in this study may deepen the understanding of metabolic hallmarks of breast cancer, and may provide some valuable information for personalized therapies in patients with breast cancer.
Collapse
Affiliation(s)
- Dongqing Su
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Shiyuan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Qilemuge Xi
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Lin Lin
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Qianzi Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Yao Yu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Yuqiang Xiong
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Haodong Wei
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Pengfei Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Yingli Lv
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| |
Collapse
|
28
|
Feng T, Zhao J, Wei D, Guo P, Yang X, Li Q, Fang Z, Wei Z, Li M, Jiang Y, Luo Y. Immunogenomic Analyses of the Prognostic Predictive Model for Patients With Renal Cancer. Front Immunol 2021; 12:762120. [PMID: 34712244 PMCID: PMC8546215 DOI: 10.3389/fimmu.2021.762120] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2021] [Accepted: 09/27/2021] [Indexed: 01/03/2023] Open
Abstract
Background Renal cell carcinoma (RCC) is associated with poor prognostic outcomes. The current stratifying system does not predict prognostic outcomes and therapeutic benefits precisely for RCC patients. Here, we aim to construct an immune prognostic predictive model to assist clinician to predict RCC prognosis. Methods Herein, an immune prognostic signature was developed, and its predictive ability was confirmed in the kidney renal clear cell carcinoma (KIRC) cohorts based on The Cancer Genome Atlas (TCGA) dataset. Several immunogenomic analyses were conducted to investigate the correlations between immune risk scores and immune cell infiltrations, immune checkpoints, cancer genotypes, tumor mutational burden, and responses to chemotherapy and immunotherapy. Results The immune prognostic signature contained 14 immune-associated genes and was found to be an independent prognostic factor for KIRC. Furthermore, the immune risk score was established as a novel marker for predicting the overall survival outcomes for RCC. The risk score was correlated with some significant immunophenotypic factors, including T cell infiltration, antitumor immunity, antitumor response, oncogenic pathways, and immunotherapeutic and chemotherapeutic response. Conclusions The immune prognostic, predictive model can be effectively and efficiently used in the prediction of survival outcomes and immunotherapeutic responses of RCC patients.
Collapse
Affiliation(s)
- Tao Feng
- Department of Urology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Jiahui Zhao
- Department of Urology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Dechao Wei
- Department of Urology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Pengju Guo
- Department of Urology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Xiaobing Yang
- Department of Urology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Qiankun Li
- Department of Urology, Beijing Huairou Hospital, Beijing, China
| | - Zhou Fang
- Department of Cardiovascular Surgery, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Ziheng Wei
- Department of Cardiovascular Surgery, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Mingchuan Li
- Department of Urology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yongguang Jiang
- Department of Urology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yong Luo
- Department of Urology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| |
Collapse
|
29
|
|
30
|
|
31
|
Campos TL, Korhonen PK, Hofmann A, Gasser RB, Young ND. Harnessing model organism genomics to underpin the machine learning-based prediction of essential genes in eukaryotes - Biotechnological implications. Biotechnol Adv 2021; 54:107822. [PMID: 34461202 DOI: 10.1016/j.biotechadv.2021.107822] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Revised: 08/17/2021] [Accepted: 08/24/2021] [Indexed: 12/17/2022]
Abstract
The availability of high-quality genomes and advances in functional genomics have enabled large-scale studies of essential genes in model eukaryotes, including the 'elegant worm' (Caenorhabditis elegans; Nematoda) and the 'vinegar fly' (Drosophila melanogaster; Arthropoda). However, this is not the case for other, much less-studied organisms, such as socioeconomically important parasites, for which functional genomic platforms usually do not exist. Thus, there is a need to develop innovative techniques or approaches for the prediction, identification and investigation of essential genes. A key approach that could enable the prediction of such genes is machine learning (ML). Here, we undertake an historical review of experimental and computational approaches employed for the characterisation of essential genes in eukaryotes, with a particular focus on model ecdysozoans (C. elegans and D. melanogaster), and discuss the possible applicability of ML-approaches to organisms such as socioeconomically important parasites. We highlight some recent results showing that high-performance ML, combined with feature engineering, allows a reliable prediction of essential genes from extensive, publicly available 'omic data sets, with major potential to prioritise such genes (with statistical confidence) for subsequent functional genomic validation. These findings could 'open the door' to fundamental and applied research areas. Evidence of some commonality in the essential gene-complement between these two organisms indicates that an ML-engineering approach could find broader applicability to ecdysozoans such as parasitic nematodes or arthropods, provided that suitably large and informative data sets become/are available for proper feature engineering, and for the robust training and validation of algorithms. This area warrants detailed exploration to, for example, facilitate the identification and characterisation of essential molecules as novel targets for drugs and vaccines against parasitic diseases. This focus is particularly important, given the substantial impact that such diseases have worldwide, and the current challenges associated with their prevention and control and with drug resistance in parasite populations.
Collapse
Affiliation(s)
- Tulio L Campos
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia; Bioinformatics Core Facility, Instituto Aggeu Magalhães, Fundação Oswaldo Cruz (IAM-Fiocruz), Recife, Pernambuco, Brazil
| | - Pasi K Korhonen
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Andreas Hofmann
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia.
| | - Neil D Young
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia.
| |
Collapse
|
32
|
Zhong J, Tang C, Peng W, Xie M, Sun Y, Tang Q, Xiao Q, Yang J. A novel essential protein identification method based on PPI networks and gene expression data. BMC Bioinformatics 2021; 22:248. [PMID: 33985429 PMCID: PMC8120700 DOI: 10.1186/s12859-021-04175-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Accepted: 05/06/2021] [Indexed: 02/08/2023] Open
Abstract
Background Some proposed methods for identifying essential proteins have better results by using biological information. Gene expression data is generally used to identify essential proteins. However, gene expression data is prone to fluctuations, which may affect the accuracy of essential protein identification. Therefore, we propose an essential protein identification method based on gene expression and the PPI network data to calculate the similarity of "active" and "inactive" state of gene expression in a cluster of the PPI network. Our experiments show that the method can improve the accuracy in predicting essential proteins. Results In this paper, we propose a new measure named JDC, which is based on the PPI network data and gene expression data. The JDC method offers a dynamic threshold method to binarize gene expression data. After that, it combines the degree centrality and Jaccard similarity index to calculate the JDC score for each protein in the PPI network. We benchmark the JDC method on four organisms respectively, and evaluate our method by using ROC analysis, modular analysis, jackknife analysis, overlapping analysis, top analysis, and accuracy analysis. The results show that the performance of JDC is better than DC, IC, EC, SC, BC, CC, NC, PeC, and WDC. We compare JDC with both NF-PIN and TS-PIN methods, which predict essential proteins through active PPI networks constructed from dynamic gene expression. Conclusions We demonstrate that the new centrality measure, JDC, is more efficient than state-of-the-art prediction methods with same input. The main ideas behind JDC are as follows: (1) Essential proteins are generally densely connected clusters in the PPI network. (2) Binarizing gene expression data can screen out fluctuations in gene expression profiles. (3) The essentiality of the protein depends on the similarity of "active" and "inactive" state of gene expression in a cluster of the PPI network.
Collapse
Affiliation(s)
- Jiancheng Zhong
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China.,Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Changsha, 410083, China
| | - Chao Tang
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China
| | - Wei Peng
- College of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, Yunnan, China
| | - Minzhu Xie
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China
| | - Yusui Sun
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China
| | - Qiang Tang
- College of Engineering and Design, Hunan Normal University, Changsha, 410081, China
| | - Qiu Xiao
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China.
| | - Jiahong Yang
- School of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China.
| |
Collapse
|
33
|
Payra AK, Saha B, Ghosh A. Ortho_Sim_Loc: Essential protein prediction using orthology and priority-based similarity approach. Comput Biol Chem 2021; 92:107503. [PMID: 33962168 DOI: 10.1016/j.compbiolchem.2021.107503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Revised: 04/02/2021] [Accepted: 04/21/2021] [Indexed: 10/21/2022]
Abstract
Proteins are the essential macro-molecules of living organism. But all proteins cannot be considered as essential in different relevant studies. Essentiality of a protein is thus computed by computation methods rather than biological experiments which in turn save both time and effort. Different computational approaches are already predicted to select essential proteins successfully with different biological significances by researchers. Most of the experimental approaches return higher false negative outcomes with respect to others. In order to retain the prediction accuracy level, a novel methodology "Ortho_Sim_Loc"has been proposed which is a combined approach of Orthology, Similarity (using clustering and priority based GO-Annotation) and Subcellular localization. Ortho_Sim_Loc can predict enriched functional set essential proteins. The predicted results are validated with other existing methods like different centrality measures, LIDC. The validation results exhibits better performance of Ortho_Sim_Loc in compare to other existing computational approaches.
Collapse
Affiliation(s)
- Anjan Kumar Payra
- Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata, 700074, India.
| | - Banani Saha
- Department of Computer Science & Engineering, University of Calcutta, Saltlake City, Kolkata, 700073, India.
| | - Anupam Ghosh
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata, 700152, India.
| |
Collapse
|
34
|
Development of machine learning model for diagnostic disease prediction based on laboratory tests. Sci Rep 2021; 11:7567. [PMID: 33828178 PMCID: PMC8026627 DOI: 10.1038/s41598-021-87171-5] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 03/19/2021] [Indexed: 01/16/2023] Open
Abstract
The use of deep learning and machine learning (ML) in medical science is increasing, particularly in the visual, audio, and language data fields. We aimed to build a new optimized ensemble model by blending a DNN (deep neural network) model with two ML models for disease prediction using laboratory test results. 86 attributes (laboratory tests) were selected from datasets based on value counts, clinical importance-related features, and missing values. We collected sample datasets on 5145 cases, including 326,686 laboratory test results. We investigated a total of 39 specific diseases based on the International Classification of Diseases, 10th revision (ICD-10) codes. These datasets were used to construct light gradient boosting machine (LightGBM) and extreme gradient boosting (XGBoost) ML models and a DNN model using TensorFlow. The optimized ensemble model achieved an F1-score of 81% and prediction accuracy of 92% for the five most common diseases. The deep learning and ML models showed differences in predictive power and disease classification patterns. We used a confusion matrix and analyzed feature importance using the SHAP value method. Our new ML model achieved high efficiency of disease prediction through classification of diseases. This study will be useful in the prediction and diagnosis of diseases.
Collapse
|
35
|
Extreme gradient boosting machine learning method for predicting medical treatment in patients with acute bronchiolitis. Biocybern Biomed Eng 2021. [DOI: 10.1016/j.bbe.2021.04.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
36
|
Jiang J, Pan H, Li M, Qian B, Lin X, Fan S. Predictive model for the 5-year survival status of osteosarcoma patients based on the SEER database and XGBoost algorithm. Sci Rep 2021; 11:5542. [PMID: 33692453 PMCID: PMC7970935 DOI: 10.1038/s41598-021-85223-4] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2020] [Accepted: 02/26/2021] [Indexed: 11/09/2022] Open
Abstract
Osteosarcoma is the most common bone malignancy, with the highest incidence in children and adolescents. Survival rate prediction is important for improving prognosis and planning therapy. However, there is still no prediction model with a high accuracy rate for osteosarcoma. Therefore, we aimed to construct an artificial intelligence (AI) model for predicting the 5-year survival of osteosarcoma patients by using extreme gradient boosting (XGBoost), a large-scale machine-learning algorithm. We identified cases of osteosarcoma in the Surveillance, Epidemiology, and End Results (SEER) Research Database and excluded substandard samples. The study population was 835 and was divided into the training set (n = 668) and validation set (n = 167). Characteristics selected via survival analyses were used to construct the model. Receiver operating characteristic (ROC) curve and decision curve analyses were performed to evaluate the prediction. The accuracy of the prediction model was excellent both in the training set (area under the ROC curve [AUC] = 0.977) and the validation set (AUC = 0.911). Decision curve analyses proved the model could be used to support clinical decisions. XGBoost is an effective algorithm for predicting 5-year survival of osteosarcoma patients. Our prediction model had excellent accuracy and is therefore useful in clinical settings.
Collapse
Affiliation(s)
- Jiuzhou Jiang
- Department of Orthopaedic Surgery, Sir Run Run Shaw Hospital, Medical College of Zhejiang University, Hangzhou, China
- Key Laboratory of Musculoskeletal System Degeneration and Regeneration Translational Research of Zhejiang Province, Hangzhou, China
| | - Hao Pan
- Department of Orthopaedics, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China
| | - Mobai Li
- Department of Orthopaedic Surgery, Sir Run Run Shaw Hospital, Medical College of Zhejiang University, Hangzhou, China
- Key Laboratory of Musculoskeletal System Degeneration and Regeneration Translational Research of Zhejiang Province, Hangzhou, China
| | - Bao Qian
- Department of Orthopaedic Surgery, Sir Run Run Shaw Hospital, Medical College of Zhejiang University, Hangzhou, China
- Key Laboratory of Musculoskeletal System Degeneration and Regeneration Translational Research of Zhejiang Province, Hangzhou, China
| | - Xianfeng Lin
- Department of Orthopaedic Surgery, Sir Run Run Shaw Hospital, Medical College of Zhejiang University, Hangzhou, China.
- Key Laboratory of Musculoskeletal System Degeneration and Regeneration Translational Research of Zhejiang Province, Hangzhou, China.
| | - Shunwu Fan
- Department of Orthopaedic Surgery, Sir Run Run Shaw Hospital, Medical College of Zhejiang University, Hangzhou, China.
- Key Laboratory of Musculoskeletal System Degeneration and Regeneration Translational Research of Zhejiang Province, Hangzhou, China.
| |
Collapse
|
37
|
iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:6664362. [PMID: 33505515 PMCID: PMC7808816 DOI: 10.1155/2021/6664362] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/21/2020] [Revised: 12/13/2020] [Accepted: 12/28/2020] [Indexed: 02/07/2023]
Abstract
Bioluminescent proteins (BLPs) are a class of proteins that widely distributed in many living organisms with various mechanisms of light emission including bioluminescence and chemiluminescence from luminous organisms. Bioluminescence has been commonly used in various analytical research methods of cellular processes, such as gene expression analysis, drug discovery, cellular imaging, and toxicity determination. However, the identification of bioluminescent proteins is challenging as they share poor sequence similarities among them. In this paper, we briefly reviewed the development of the computational identification of BLPs and subsequently proposed a novel predicting framework for identifying BLPs based on eXtreme gradient boosting algorithm (XGBoost) and using sequence-derived features. To train the models, we collected BLP data from bacteria, eukaryote, and archaea. Then, for getting more effective prediction models, we examined the performances of different feature extraction methods and their combinations as well as classification algorithms. Finally, based on the optimal model, a novel predictor named iBLP was constructed to identify BLPs. The robustness of iBLP has been proved by experiments on training and independent datasets. Comparison with other published method further demonstrated that the proposed method is powerful and could provide good performance for BLP identification. The webserver and software package for BLP identification are freely available at http://lin-group.cn/server/iBLP.
Collapse
|
38
|
Gonzalez Hernandez F, Carter SJ, Iso-Sipilä J, Goldsmith P, Almousa AA, Gastine S, Lilaonitkul W, Kloprogge F, Standing JF. An automated approach to identify scientific publications reporting pharmacokinetic parameters. Wellcome Open Res 2021; 6:88. [PMID: 34381873 PMCID: PMC8343403 DOI: 10.12688/wellcomeopenres.16718.1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/09/2021] [Indexed: 11/29/2022] Open
Abstract
Pharmacokinetic (PK) predictions of new chemical entities are aided by prior knowledge from other compounds. The development of robust algorithms that improve preclinical and clinical phases of drug development remains constrained by the need to search, curate and standardise PK information across the constantly-growing scientific literature. The lack of centralised, up-to-date and comprehensive repositories of PK data represents a significant limitation in the drug development pipeline.In this work, we propose a machine learning approach to automatically identify and characterise scientific publications reporting PK parameters from in vivo data, providing a centralised repository of PK literature. A dataset of 4,792 PubMed publications was labelled by field experts depending on whether in vivo PK parameters were estimated in the study. Different classification pipelines were compared using a bootstrap approach and the best-performing architecture was used to develop a comprehensive and automatically-updated repository of PK publications. The best-performing architecture encoded documents using unigram features and mean pooling of BioBERT embeddings obtaining an F1 score of 83.8% on the test set. The pipeline retrieved over 121K PubMed publications in which in vivo PK parameters were estimated and it was scheduled to perform weekly updates on newly published articles. All the relevant documents were released through a publicly available web interface (https://app.pkpdai.com) and characterised by the drugs, species and conditions mentioned in the abstract, to facilitate the subsequent search of relevant PK data. This automated, open-access repository can be used to accelerate the search and comparison of PK results, curate ADME datasets, and facilitate subsequent text mining tasks in the PK domain.
Collapse
Affiliation(s)
| | - Simon J Carter
- Institute of Pharmacy, Uppsala University, Uppsala, Sweden.,Institute for Global Health, University College London, London, UK
| | | | | | | | - Silke Gastine
- Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Watjana Lilaonitkul
- Institute of Health Informatics, University College London, London, UK.,Health Data Research, London, UK
| | - Frank Kloprogge
- Institute for Global Health, University College London, London, UK
| | - Joseph F Standing
- Great Ormond Street Institute of Child Health, University College London, London, UK
| |
Collapse
|
39
|
Li L, Lin Y, Yu D, Liu Z, Gao Y, Qiao J. A Multi-Organ Fusion and LightGBM Based Radiomics Algorithm for High-Risk Esophageal Varices Prediction in Cirrhotic Patients. IEEE ACCESS 2021; 9:15041-15052. [DOI: 10.1109/access.2021.3052776] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2025]
|
40
|
Wang S, Xiong Y, Zhang Q, Su D, Yu C, Cao Y, Pan Y, Lu Q, Zuo Y, Yang L. Clinical significance and immunogenomic landscape analyses of the immune cell signature based prognostic model for patients with breast cancer. Brief Bioinform 2020; 22:6030109. [PMID: 33302293 DOI: 10.1093/bib/bbaa311] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2020] [Revised: 09/27/2020] [Accepted: 10/13/2020] [Indexed: 12/11/2022] Open
Abstract
Breast cancer is one of the most common types of cancers and the leading cause of death from malignancy among women worldwide. Tumor-infiltrating lymphocytes are a source of important prognostic biomarkers for breast cancer patients. In this study, based on the tumor-infiltrating lymphocytes in the tumor immune microenvironment, a risk score prognostic model was developed in the training cohort for risk stratification and prognosis prediction in breast cancer patients. The prognostic value of this risk score prognostic model was also verified in the two testing cohorts and the TCGA pan cancer cohort. Nomograms were also established in the training and testing cohorts to validate the clinical use of this model. Relationships between the risk score, intrinsic molecular subtypes, immune checkpoints, tumor-infiltrating immune cell abundances and the response to chemotherapy and immunotherapy were also evaluated. Based on these results, we can conclude that this risk score model could serve as a robust prognostic biomarker, provide therapeutic benefits for the development of novel chemotherapy and immunotherapy, and may be helpful for clinical decision making in breast cancer patients.
Collapse
Affiliation(s)
- Shiyuan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University
| | - Yuqiang Xiong
- College of Bioinformatics Science and Technology, Harbin Medical University
| | - Qi Zhang
- College of Bioinformatics Science and Technology, Harbin Medical University
| | - Dongqing Su
- College of Bioinformatics Science and Technology, Harbin Medical University
| | - Chunlu Yu
- Public Health College, Harbin Medical University
| | - Yiyin Cao
- Public Health College, Harbin Medical University
| | - Yi Pan
- College of Bioinformatics Science and Technology, Harbin Medical University
| | - Qianzi Lu
- College of Bioinformatics Science and Technology, Harbin Medical University
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University
| |
Collapse
|
41
|
Construction and Validation of Predictive Model to Identify Critical Genes Associated with Advanced Kidney Disease. Int J Genomics 2020; 2020:7524057. [PMID: 33274190 PMCID: PMC7676934 DOI: 10.1155/2020/7524057] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 08/21/2020] [Accepted: 10/19/2020] [Indexed: 12/12/2022] Open
Abstract
Background Chronic kidney disease (CKD) is characterized by progressive renal function loss, which may finally lead to end-stage renal disease (ESRD). The study is aimed at identifying crucial genes related to CKD progressive and constructing a disease prediction model to investigate risk factors. Methods GSE97709 and GSE37171 datasets were downloaded from the GEO database including peripheral blood samples from subjects with CKD, ESRD, and healthy controls. Differential expressed genes (DEGs) were identified and functional enrichment analysis. Machine learning algorithm-based prediction model was constructed to identify crucial functional feature genes related to ESRD. Results A total of 76 DEGs were screened from CDK vs. normal samples while 10,114 DEGs were identified from ESRD vs. CDK samples. For numerous genes related to ESRD, several GO biological terms and 141 signaling pathways were identified including markedly upregulated olfactory transduction and downregulated platelet activation pathway. The DEGs were clustering in three modules according to WGCNA access, namely, ME1, ME2, and ME3. By construction of the XGBoost model and dataset validation, we screened cohorts of genes associated with progressive CKD, such as FZD10, FOXD4, and FAM215A. FZD10 represented the highest score (F score = 21) in predictive model. Conclusion Our results demonstrated that FZD10, FOXD4, PPP3R1, and UCP2 might be critical genes in CKD progression.
Collapse
|
42
|
Hon KK, Ng CW, Chan PW. Machine learning based multi-index prediction of aviation turbulence over the Asia-Pacific. MACHINE LEARNING WITH APPLICATIONS 2020. [DOI: 10.1016/j.mlwa.2020.100008] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
|
43
|
Zhang Y, Chen P, Gao Y, Ni J, Wang X. DBP-PSSM: Combination of Evolutionary Profiles with the XGBoost Algorithm to Improve the Identification of DNA-binding Proteins. Comb Chem High Throughput Screen 2020; 25:3-12. [PMID: 33238837 DOI: 10.2174/1386207323999201124203531] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 10/16/2020] [Accepted: 10/29/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND AND OBJECTIVE DNA-binding proteins play important roles in a variety of biological processes, such as gene transcription and regulation, DNA replication and repair, DNA recombination and packaging, and the formation of chromatin and ribosomes. Therefore, it is urgent to develop a computational method to improve the recognition efficiency of DNA-binding proteins. METHODS We proposed a novel method, DBP-PSSM, which constructed the features from amino acid composition and evolutionary information of protein sequences. The maximum relevance, minimum redundancy (mRMR) was employed to select the optimal features for establishing the XGBoost classifier, therefore, the novel model of prediction DNA-binding proteins, DBP-PSSM, was established with 5-fold cross-validation on the training dataset. RESULTS DBP-PSSM achieved an accuracy of 81.18% and MCC of 0.657 in a test dataset, which outperformed the many existing methods. These results demonstrated that our method can effectively predict DNA-binding proteins. CONCLUSION The data and source code are provided at https://github.com/784221489/DNA-binding.
Collapse
Affiliation(s)
- Yanping Zhang
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Pengcheng Chen
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Ya Gao
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Jianwei Ni
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Xiaosheng Wang
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| |
Collapse
|
44
|
Payra AK, Ghosh A. Identifying essential proteins using modified-monkey algorithm (MMA). Comput Biol Chem 2020; 88:107324. [DOI: 10.1016/j.compbiolchem.2020.107324] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 05/28/2020] [Accepted: 06/24/2020] [Indexed: 11/15/2022]
|
45
|
Peng C, Zheng Y, Huang DS. Capsule Network Based Modeling of Multi-omics Data for Discovery of Breast Cancer-Related Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1605-1612. [PMID: 30969931 DOI: 10.1109/tcbb.2019.2909905] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Breast cancer is one of the most common cancers all over the world, which bring about more than 450,000 deaths each year. Although this malignancy has been extensively studied by a large number of researchers, its prognosis is still poor. Since therapeutic advance can be obtained based on gene signatures, there is an urgent need to discover genes related to breast cancer that may help uncover the mechanisms in cancer progression. We propose a deep learning method for the discovery of breast cancer-related genes by using Capsule Network based Modeling of Multi-omics Data (CapsNetMMD). In CapsNetMMD, we make use of known breast cancer-related genes to transform the issue of gene identification into the issue of supervised classification. The features of genes are generated through comprehensive integration of multi-omics data, e.g., mRNA expression, z scores for mRNA expression, DNA methylation, and two forms of DNA copy-number alterations (CNAs). By modeling features based on the capsule network, we identify breast cancer-related genes with a significantly better performance than other existing machine learning methods. The predicted genes with prognostic values play potential important roles in breast cancer and may serve as candidates for biologists and medical scientists in the future studies of biomarkers.
Collapse
|
46
|
Predicting Preference of Transcription Factors for Methylated DNA Using Sequence Information. MOLECULAR THERAPY. NUCLEIC ACIDS 2020; 22:1043-1050. [PMID: 33294291 PMCID: PMC7691157 DOI: 10.1016/j.omtn.2020.07.035] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Accepted: 07/28/2020] [Indexed: 12/12/2022]
Abstract
Transcription factors play key roles in cell-fate decisions by regulating 3D genome conformation and gene expression. The traditional view is that methylation of DNA hinders transcription factors binding to them, but recent research has shown that many transcription factors prefer to bind to methylated DNA. Therefore, identifying such transcription factors and understanding their functions is a stepping-stone for studying methylation-mediated biological processes. In this paper, a two-step discriminated method was proposed to recognize transcription factors and their preference for methylated DNA based only on sequences information. In the first step, the proposed model was used to discriminate transcription factors from non-transcription factors. The areas under the curve (AUCs) are 0.9183 and 0.9116, respectively, for the 5-fold cross-validation test and independent dataset test. Subsequently, for the classification of transcription factors that prefer methylated DNA and transcription factors that prefer non-methylated DNA, our model could produce the AUCs of 0.7744 and 0.7356, respectively, for the 5-fold cross-validation test and independent dataset test. Based on the proposed model, a user-friendly web server called TFPred was built, which can be freely accessed at http://lin-group.cn/server/TFPred/.
Collapse
|
47
|
Prediction of Type 2 Diabetes Risk and Its Effect Evaluation Based on the XGBoost Model. Healthcare (Basel) 2020; 8:healthcare8030247. [PMID: 32751894 PMCID: PMC7551910 DOI: 10.3390/healthcare8030247] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2020] [Revised: 07/27/2020] [Accepted: 07/29/2020] [Indexed: 11/17/2022] Open
Abstract
In view of the harm of diabetes to the population, we have introduced an ensemble learning algorithm—EXtreme Gradient Boosting (XGBoost) to predict the risk of type 2 diabetes and compared it with Support Vector Machines (SVM), the Random Forest (RF) and K-Nearest Neighbor (K-NN) algorithm in order to improve the prediction effect of existing models. The combination of convenient sampling and snowball sampling in Xicheng District, Beijing was used to conduct a questionnaire survey on the personal data, eating habits, exercise status and family medical history of 380 middle-aged and elderly people. Then, we trained the models and obtained the disease risk index for each sample with 10-fold cross-validation. Experiments were made to compare the commonly used machine learning algorithms mentioned above and we found that XGBoost had the best prediction effect, with an average accuracy of 0.8909 and the area under the receiver’s working characteristic curve (AUC) was 0.9182. Therefore, due to the superiority of its architecture, XGBoost has more outstanding prediction accuracy and generalization ability than existing algorithms in predicting the risk of type 2 diabetes, which is conducive to the intelligent prevention and control of diabetes in the future.
Collapse
|
48
|
Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data. Comput Biol Med 2020; 121:103761. [DOI: 10.1016/j.compbiomed.2020.103761] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 04/10/2020] [Accepted: 04/10/2020] [Indexed: 12/31/2022]
|
49
|
Predictions of Apoptosis Proteins by Integrating Different Features Based on Improving Pseudo-Position-Specific Scoring Matrix. BIOMED RESEARCH INTERNATIONAL 2020; 2020:4071508. [PMID: 32420339 PMCID: PMC7201498 DOI: 10.1155/2020/4071508] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Accepted: 12/19/2019] [Indexed: 11/25/2022]
Abstract
Apoptosis proteins are strongly related to many diseases and play an indispensable role in maintaining the dynamic balance between cell death and division in vivo. Obtaining localization information on apoptosis proteins is necessary in understanding their function. To date, few researchers have focused on the problem of apoptosis data imbalance before classification, while this data imbalance is prone to misclassification. Therefore, in this work, we introduce a method to resolve this problem and to enhance prediction accuracy. Firstly, the features of the protein sequence are captured by combining Improving Pseudo-Position-Specific Scoring Matrix (IM-Psepssm) with the Bidirectional Correlation Coefficient (Bid-CC) algorithm from position-specific scoring matrix. Secondly, different features of fusion and resampling strategies are used to reduce the impact of imbalance on apoptosis protein datasets. Finally, the eigenvector adopts the Support Vector Machine (SVM) to the training classification model, and the prediction accuracy is evaluated by jackknife cross-validation tests. The experimental results indicate that, under the same feature vector, adopting resampling methods remarkably boosts many significant indicators in the unsampling method for predicting the localization of apoptosis proteins in the ZD98, ZW225, and CL317 databases. Additionally, we also present new user-friendly local software for readers to apply; the codes and software can be freely accessed at https://github.com/ruanxiaoli/Im-Psepssm.
Collapse
|
50
|
Campos TL, Korhonen PK, Sternberg PW, Gasser RB, Young ND. Predicting gene essentiality in Caenorhabditis elegans by feature engineering and machine-learning. Comput Struct Biotechnol J 2020; 18:1093-1102. [PMID: 32489524 PMCID: PMC7251299 DOI: 10.1016/j.csbj.2020.05.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Revised: 05/01/2020] [Accepted: 05/06/2020] [Indexed: 02/08/2023] Open
Abstract
Defining genes that are essential for life has major implications for understanding critical biological processes and mechanisms. Although essential genes have been identified and characterised experimentally using functional genomic tools, it is challenging to predict with confidence such genes from molecular and phenomic data sets using computational methods. Using extensive data sets available for the model organism Caenorhabditis elegans, we constructed here a machine-learning (ML)-based workflow for the prediction of essential genes on a genome-wide scale. We identified strong predictors for such genes and showed that trained ML models consistently achieve highly-accurate classifications. Complementary analyses revealed an association between essential genes and chromosomal location. Our findings reveal that essential genes in C. elegans tend to be located in or near the centre of autosomal chromosomes; are positively correlated with low single nucleotide polymorphim (SNP) densities and epigenetic markers in promoter regions; are involved in protein and nucleotide processing; are transcribed in most cells; are enriched in reproductive tissues or are targets for small RNAs bound to the argonaut CSR-1. Based on these results, we hypothesise an interplay between epigenetic markers and small RNA pathways in the germline, with transcription-based memory; this hypothesis warrants testing. From a technical perspective, further work is needed to evaluate whether the present ML-based approach will be applicable to other metazoans (including Drosophila melanogaster) for which comprehensive data sets (i.e. genomic, transcriptomic, proteomic, variomic, epigenetic and phenomic) are available.
Collapse
Key Words
- CDS, coding sequence
- CRISPR, Clustered Regularly Interspaced Short Palindromic Repeats
- Caenorhabditis elegans
- ES, Essentiality Score
- EST, expressed sequence tag
- Essential genes
- Essentiality predictions
- GBM, Gradient Boosting Method
- GFF, general feature format
- GLM, Generalised Linear Model
- GO, gene ontology
- ML, machine-learning
- Machine-learning
- NN, Artificial Neural Network
- PPI, protein-protein interaction
- PR-AUC, Area Under the Precision-Recall Curve
- RF, Random Forest
- RNAi, RNA interference
- ROC-AUC, Area Under the Receiver Operating Characteristic Curve
- SNP, single nucleotide polymorphism
- SPLS, Sparse Partial Least Squares
- SVM, Support-Vector Machine
- TEA, Tissue Enrichment Analysis tool (WormBase)
- TSS, transcription start site
- VCF, variant call file
Collapse
Affiliation(s)
- Tulio L Campos
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia.,Instituto Aggeu Magalhães, Fundação Oswaldo Cruz (IAM-Fiocruz), Recife, Pernambuco, Brazil
| | - Pasi K Korhonen
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Paul W Sternberg
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, United States
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Neil D Young
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| |
Collapse
|