1
|
Shen K, Lin J. Unraveling the Molecular Landscape of Neutrophil Extracellular Traps in Severe Asthma: Identification of Biomarkers and Molecular Clusters. Mol Biotechnol 2025; 67:1852-1866. [PMID: 38801616 DOI: 10.1007/s12033-024-01164-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Accepted: 04/08/2024] [Indexed: 05/29/2024]
Abstract
Neutrophil extracellular traps (NETs) play a central role in chronic airway diseases. However, the precise genetic basis linking NETs to the development of severe asthma remains elusive. This study aims to unravel the molecular characterization of NET-related genes (NRGs) in severe asthma and to reliably identify relevant molecular clusters and biomarkers. We analyzed RNA-seq data from the Gene Expression Omnibus database. Interaction analysis revealed fifty differentially expressed NRGs (DE-NRGs). Subsequently, the non-negative matrix factorization algorithm categorized samples from severe asthma patients. A machine learning algorithm then identified core NRGs that were highly associated with severe asthma. DE-NRGs were correlated and subjected to protein-protein interaction analysis. Unsupervised consensus clustering of the core gene expression profiles delineated two distinct clusters (C1 and C2) characterizing severe asthma. Functional enrichment highlighted immune-related pathways in the C2 cluster. Core gene selection included the Boruta algorithm, support vector machine, and least absolute contraction and selection operator algorithms. Diagnostic performance was assessed by receiver operating characteristic curves. This study addresses the molecular characterization of NRGs in adult severe asthma, revealing distinct clusters based on DE-NRGs. Potential biomarkers (TIMP1 and NFIL3) were identified that may be important for early diagnosis and treatment of severe asthma.
Collapse
Affiliation(s)
- Kunlu Shen
- National Center for Respiratory Medicine, National Clinical Research Center for Respiratory Diseases, Institute of Respiratory Medicine, Chinese Academy of Medical Sciences, Department of Pulmonary and Critical Care Medicine, Center of Respiratory Medicine, China-Japan Friendship Hospital, No. 2, East Yinghua Road, Chaoyang District, Beijing, 100029, China
- Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, China
| | - Jiangtao Lin
- National Center for Respiratory Medicine, National Clinical Research Center for Respiratory Diseases, Institute of Respiratory Medicine, Chinese Academy of Medical Sciences, Department of Pulmonary and Critical Care Medicine, Center of Respiratory Medicine, China-Japan Friendship Hospital, No. 2, East Yinghua Road, Chaoyang District, Beijing, 100029, China.
- Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, China.
| |
Collapse
|
2
|
Choudhary A, Anand A, Singh A, Roy P, Singh N, Kumar V, Sharma S, Baranwal M. Machine learning-based ensemble approach in prediction of lung cancer predisposition using XRCC1 gene polymorphism. J Biomol Struct Dyn 2024; 42:7828-7837. [PMID: 37545160 DOI: 10.1080/07391102.2023.2242492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Accepted: 07/23/2023] [Indexed: 08/08/2023]
Abstract
The employment of machine learning approaches has shown promising results in predicting cancer. In the current study, polymorphisms data of five single nucleotide polymorphisms (SNPs) of DNA repair gene XRCC1 (XRCC1 399, XRCC1 194, XRCC1 206, XRCC1 632, XRCC1 280) of the north Indian population along with four smoking status data is considered as an input to the proposed ensemble model to predict the risk of individual susceptibility to the lung cancer. The prediction accuracy of the proposed ensemble model for cancer predisposition was found to be 85%. The model performance is also evaluated using sensitivity, specificity, precision and the Gini index, which is found in the range of 0.83-0.87. The proposed model also outperformed in all evaluation parameters when compared with the individual Model (LM, SVM, RF, KNN and baseline neural net). Collectively, current results suggest the potential of the proposed ensemble model in predicting the risk of cancer based on XRCC1 SNPs data.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Abhishek Choudhary
- Department of Computer Science, Thapar Institute of Engineering & Technology, India
| | - Adarsh Anand
- Department of Electronics & Communication Engineering, Thapar Institute of Engineering & Technology, India
| | - Amrita Singh
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| | - Pratima Roy
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| | - Navneet Singh
- Department of Pulmonary Medicine, Post Graduate Institute of Education and Medical Research (PGIMER), Chandigarh, India
| | - Vinay Kumar
- Department of Electronics & Communication Engineering, Thapar Institute of Engineering & Technology, India
| | - Siddharth Sharma
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| | - Manoj Baranwal
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| |
Collapse
|
3
|
Gomes LGDS, Cruz ÁASD, de Santana MBR, Pinheiro GP, Santana CVN, Santos CBS, Boorgula MP, Campbell M, Machado ADS, Veiga RV, Barnes KC, Costa RDS, Figueiredo CA. Predictive genetic panel for adult asthma using machine learning methods. THE JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY. GLOBAL 2024; 3:100282. [PMID: 38952894 PMCID: PMC11215340 DOI: 10.1016/j.jacig.2024.100282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 02/20/2024] [Accepted: 04/05/2024] [Indexed: 07/03/2024]
Abstract
Background Asthma is a chronic inflammatory disease of the airways that is heterogeneous and multifactorial, making its accurate characterization a complex process. Therefore, identifying the genetic variations associated with asthma and discovering the molecular interactions between the omics that confer risk of developing this disease will help us to unravel the biological pathways involved in its pathogenesis. Objective We sought to develop a predictive genetic panel for asthma using machine learning methods. Methods We tested 3 variable selection methods: Boruta's algorithm, the top 200 genome-wide association study markers according to their respective P values, and an elastic net regression. Ten different algorithms were chosen for the classification tests. A predictive panel was built on the basis of joint scores between the classification algorithms. Results Two variable selection methods, Boruta and genome-wide association studies, were statistically similar in terms of the average accuracies generated, whereas elastic net had the worst overall performance. The predictive genetic panel was completed with 155 single-nucleotide variants, with 91.18% accuracy, 92.75% sensitivity, and 89.55% specificity using the support vector machine algorithm. The markers used range from known single-nucleotide variants to those not previously described in the literature. Our study shows potential in creating genetic prediction panels with tailored penalties per marker, aiding in the identification of optimal machine learning methods for intricate results. Conclusions This method is able to classify asthma and nonasthma effectively, proving its potential utility in clinical prediction and diagnosis.
Collapse
Affiliation(s)
| | | | | | | | - Cinthia Vila Nova Santana
- Programa de Controle da Asma na Bahia (ProAR), Universidade Federal da Bahia, Salvador, Bahia, Brazil
| | | | | | - Monica Campbell
- Department of Medicine, University of Colorado Denver, Aurora, Colo
| | - Adelmir de Souza Machado
- Instituto de Ciências da Saúde, Universidade Federal da Bahia, Salvador, Bahia, Brazil
- Programa de Controle da Asma na Bahia (ProAR), Universidade Federal da Bahia, Salvador, Bahia, Brazil
| | - Rafael Valente Veiga
- Laboratory of Lymphocyte Signalling and Development, The Babraham Institute, Cambridge, United Kingdom
| | | | - Ryan dos Santos Costa
- Instituto de Ciências da Saúde, Universidade Federal da Bahia, Salvador, Bahia, Brazil
| | | |
Collapse
|
4
|
Gunawardana J, Viswakula SD, Rannan-Eliya RP, Wijemunige N. Machine learning approaches for asthma disease prediction among adults in Sri Lanka. Health Informatics J 2024; 30:14604582241283968. [PMID: 39262121 DOI: 10.1177/14604582241283968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/13/2024]
Abstract
Objectives: Addressing the challenge of cost-effective asthma diagnosis amidst diverse symptom patterns among patients, this study aims to develop a machine learning-based asthma prediction tool for self-detection of asthma. Methods: Data from 6,665 participants in the Sri Lanka Health and Ageing Study (2018-2019) are used for this research. Thirteen machine learning algorithms, including Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, Naïve Bayes, K-Nearest Neighbors, Gradient Boost, XGBoost, AdaBoost, CatBoost, LightGBM, Multi-Layer Perceptron, and Probabilistic Neural Network, are employed. Results: A hybrid version of Logistic Regression and LightGBM outperformed other models, achieving an AUC of 0.9062 and 79.85% sensitivity. Key predictive features for asthma include wheezing, breathlessness with wheezing, shortness of breath attacks, coughing attacks, chest tightness, nasal allergies, physical activity, passive smoking, ethnicity, and residential sector. Conclusion: Combining Logistic Regression and LightGBM models can effectively predict adult asthma based on self-reported symptoms and demographic and behavioural characteristics. The proposed expert system assists clinicians and patients in diagnosing potential asthma cases.
Collapse
Affiliation(s)
- Jrna Gunawardana
- Institute for Health Policy, Sri Lanka and Robert Gordon University, UK
| | - S D Viswakula
- Department of Statistics, University of Colombo, Sri Lanka
| | | | | |
Collapse
|
5
|
Darsha Jayamini WK, Mirza F, Asif Naeem M, Chan AHY. Investigating Machine Learning Techniques for Predicting Risk of Asthma Exacerbations: A Systematic Review. J Med Syst 2024; 48:49. [PMID: 38739297 PMCID: PMC11090925 DOI: 10.1007/s10916-024-02061-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Accepted: 04/04/2024] [Indexed: 05/14/2024]
Abstract
Asthma, a common chronic respiratory disease among children and adults, affects more than 200 million people worldwide and causes about 450,000 deaths each year. Machine learning is increasingly applied in healthcare to assist health practitioners in decision-making. In asthma management, machine learning excels in performing well-defined tasks, such as diagnosis, prediction, medication, and management. However, there remain uncertainties about how machine learning can be applied to predict asthma exacerbation. This study aimed to systematically review recent applications of machine learning techniques in predicting the risk of asthma attacks to assist asthma control and management. A total of 860 studies were initially identified from five databases. After the screening and full-text review, 20 studies were selected for inclusion in this review. The review considered recent studies published from January 2010 to February 2023. The 20 studies used machine learning techniques to support future asthma risk prediction by using various data sources such as clinical, medical, biological, and socio-demographic data sources, as well as environmental and meteorological data. While some studies considered prediction as a category, other studies predicted the probability of exacerbation. Only a group of studies applied prediction windows. The paper proposes a conceptual model to summarise how machine learning and available data sources can be leveraged to produce effective models for the early detection of asthma attacks. The review also generated a list of data sources that other researchers may use in similar work. Furthermore, we present opportunities for further research and the limitations of the preceding studies.
Collapse
Affiliation(s)
- Widana Kankanamge Darsha Jayamini
- School of Engineering, Computer and Mathematical Sciences, Auckland University of Technology, Auckland, 1010, New Zealand.
- Department of Software Engineering, Faculty of Computing and Technology, University of Kelaniya, Kelaniya, 11300, Sri Lanka.
| | - Farhaan Mirza
- School of Engineering, Computer and Mathematical Sciences, Auckland University of Technology, Auckland, 1010, New Zealand
| | - M Asif Naeem
- Department of Data Science & Artificial Intelligence, National University of Computer and Emerging Sciences (NUCES), Islamabad, 44000, Pakistan
| | - Amy Hai Yan Chan
- School of Pharmacy, Faculty of Medical and Health Sciences, University of Auckland, Auckland, 1142, New Zealand
| |
Collapse
|
6
|
Barnett EJ, Onete DG, Salekin A, Faraone SV. Genomic Machine Learning Meta-regression: Insights on Associations of Study Features With Reported Model Performance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:169-177. [PMID: 38109236 DOI: 10.1109/tcbb.2023.3343808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2023]
Abstract
Many studies have been conducted with the goal of correctly predicting diagnostic status of a disorder using the combination of genomic data and machine learning. It is often hard to judge which components of a study led to better results and whether better reported results represent a true improvement or an uncorrected bias inflating performance. We extracted information about the methods used and other differentiating features in genomic machine learning models. We used these features in linear regressions predicting model performance. We tested for univariate and multivariate associations as well as interactions between features. Of the models reviewed, 46% used feature selection methods that can lead to data leakage. Across our models, the number of hyperparameter optimizations reported, data leakage due to feature selection, model type, and modeling an autoimmune disorder were significantly associated with an increase in reported model performance. We found a significant, negative interaction between data leakage and training size. Our results suggest that methods susceptible to data leakage are prevalent among genomic machine learning research, resulting in inflated reported performance. Best practice guidelines that promote the avoidance and recognition of data leakage may help the field avoid biased results.
Collapse
|
7
|
Qin ZM, Liang SQ, Long JX, Deng JM, Wei X, Yang ML, Tang SJ, Li HL. Importance of GWAS Risk Loci and Clinical Data in Predicting Asthma Using Machine-learning Approaches. Comb Chem High Throughput Screen 2024; 27:400-407. [PMID: 37278039 DOI: 10.2174/1386207326666230602161939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 04/17/2023] [Accepted: 05/04/2023] [Indexed: 06/07/2023]
Abstract
INTRODUCTION To understand the risk factors of asthma, we combined genome-wide association study (GWAS) risk loci and clinical data in predicting asthma using machine-learning approaches. METHODS A case-control study with 123 asthmatics and 100 controls was conducted in the Zhuang population in Guangxi. GWAS risk loci were detected using polymerase chain reaction, and clinical data were collected. Machine-learning approaches were used to identify the major factors that contribute to asthma. RESULTS A total of 14 GWAS risk loci with clinical data were analyzed on the basis of 10 times the 10-fold cross-validation for all machine-learning models. Using GWAS risk loci or clinical data, the best performances exhibited area under the curve (AUC) values of 64.3% and 71.4%, respectively. Combining GWAS risk loci and clinical data, the XGBoost established the best model with an AUC of 79.7%, indicating that the combination of genetics and clinical data can enable improved performance. We then sorted the importance of features and found the top six risk factors for predicting asthma to be rs3117098, rs7775228, family history, rs2305480, rs4833095, and body mass index. CONCLUSION Asthma-prediction models based on GWAS risk loci and clinical data can accurately predict asthma, and thus provide insights into the disease pathogenesis.
Collapse
Affiliation(s)
- Zan-Mei Qin
- Department of Respiratory and Critical Care Medicine, First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, China
| | - Si-Qiao Liang
- Department of Respiratory and Critical Care Medicine, First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, China
| | - Jian-Xiong Long
- Department of Epidemiology and Health Statistics, School of Public Health of Guangxi Medical University, Nanning, Guangxi, China
| | - Jing-Min Deng
- Department of Respiratory and Critical Care Medicine, First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, China
| | - Xuan Wei
- Department of Respiratory and Critical Care Medicine, First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, China
| | - Mei-Ling Yang
- Department of Respiratory and Critical Care Medicine, First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, China
| | - Shao-Jie Tang
- School of Automation, Xi'an University of Posts and Telecommunications, Xi'an, Shanxi, 710121, China
- Xi'an Key Laboratory of Advanced Controlling and Intelligent Processing (ACIP), Xi'an, Shanxi, 710121, China
| | - Hai-Li Li
- Department of Respiratory and Critical Care Medicine, First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, China
| |
Collapse
|
8
|
Sigala RE, Lagou V, Shmeliov A, Atito S, Kouchaki S, Awais M, Prokopenko I, Mahdi A, Demirkan A. Machine Learning to Advance Human Genome-Wide Association Studies. Genes (Basel) 2023; 15:34. [PMID: 38254924 PMCID: PMC10815885 DOI: 10.3390/genes15010034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Revised: 12/19/2023] [Accepted: 12/22/2023] [Indexed: 01/24/2024] Open
Abstract
Machine learning, including deep learning, reinforcement learning, and generative artificial intelligence are revolutionising every area of our lives when data are made available. With the help of these methods, we can decipher information from larger datasets while addressing the complex nature of biological systems in a more efficient way. Although machine learning methods have been introduced to human genetic epidemiological research as early as 2004, those were never used to their full capacity. In this review, we outline some of the main applications of machine learning to assigning human genetic loci to health outcomes. We summarise widely used methods and discuss their advantages and challenges. We also identify several tools, such as Combi, GenNet, and GMSTool, specifically designed to integrate these methods for hypothesis-free analysis of genetic variation data. We elaborate on the additional value and limitations of these tools from a geneticist's perspective. Finally, we discuss the fast-moving field of foundation models and large multi-modal omics biobank initiatives.
Collapse
Affiliation(s)
- Rafaella E. Sigala
- Section of Statistical Multi-Omics, Department of Clinical and Experimental Medicine, Guildford GU2 7XH, Surrey, UK; (R.E.S.); (V.L.); (A.S.); (I.P.)
| | - Vasiliki Lagou
- Section of Statistical Multi-Omics, Department of Clinical and Experimental Medicine, Guildford GU2 7XH, Surrey, UK; (R.E.S.); (V.L.); (A.S.); (I.P.)
| | - Aleksey Shmeliov
- Section of Statistical Multi-Omics, Department of Clinical and Experimental Medicine, Guildford GU2 7XH, Surrey, UK; (R.E.S.); (V.L.); (A.S.); (I.P.)
| | - Sara Atito
- Surrey Institute for People-Centred Artificial Intelligence, University of Surrey, Guildford GU2 7XH, Surrey, UK; (S.A.); (S.K.); (M.A.)
- Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford GU2 7XH, Surrey, UK
| | - Samaneh Kouchaki
- Surrey Institute for People-Centred Artificial Intelligence, University of Surrey, Guildford GU2 7XH, Surrey, UK; (S.A.); (S.K.); (M.A.)
- Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford GU2 7XH, Surrey, UK
| | - Muhammad Awais
- Surrey Institute for People-Centred Artificial Intelligence, University of Surrey, Guildford GU2 7XH, Surrey, UK; (S.A.); (S.K.); (M.A.)
- Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford GU2 7XH, Surrey, UK
| | - Inga Prokopenko
- Section of Statistical Multi-Omics, Department of Clinical and Experimental Medicine, Guildford GU2 7XH, Surrey, UK; (R.E.S.); (V.L.); (A.S.); (I.P.)
- Surrey Institute for People-Centred Artificial Intelligence, University of Surrey, Guildford GU2 7XH, Surrey, UK; (S.A.); (S.K.); (M.A.)
| | - Adam Mahdi
- Oxford Internet Institute, University of Oxford, Oxford OX1 3JS, Oxfordshire, UK;
| | - Ayse Demirkan
- Section of Statistical Multi-Omics, Department of Clinical and Experimental Medicine, Guildford GU2 7XH, Surrey, UK; (R.E.S.); (V.L.); (A.S.); (I.P.)
- Surrey Institute for People-Centred Artificial Intelligence, University of Surrey, Guildford GU2 7XH, Surrey, UK; (S.A.); (S.K.); (M.A.)
| |
Collapse
|
9
|
Astrologo NCN, Gaudillo JD, Albia JR, Roxas-Villanueva RML. Genetic risk assessment based on association and prediction studies. Sci Rep 2023; 13:15230. [PMID: 37709797 PMCID: PMC10502006 DOI: 10.1038/s41598-023-41862-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 09/01/2023] [Indexed: 09/16/2023] Open
Abstract
The genetic basis of phenotypic emergence provides valuable information for assessing individual risk. While association studies have been pivotal in identifying genetic risk factors within a population, complementing it with insights derived from predictions studies that assess individual-level risk offers a more comprehensive approach to understanding phenotypic expression. In this study, we established personalized risk assessment models using single-nucleotide polymorphism (SNP) data from 200 Korean patients, of which 100 experienced hepatitis B surface antigen (HBsAg) seroclearance and 100 patients demonstrated high levels of HBsAg. The risk assessment models determined the predictive power of the following: (1) genome-wide association study (GWAS)-identified candidate biomarkers considered significant in a reference study and (2) machine learning (ML)-identified candidate biomarkers with the highest feature importance scores obtained by using random forest (RF). While utilizing all features yielded 64% model accuracy, using relevant biomarkers achieved higher model accuracies: 82% for 52 GWAS-identified candidate biomarkers, 71% for three GWAS-identified biomarkers, and 80% for 150 ML-identified candidate biomarkers. Findings highlight that the joint contributions of relevant biomarkers significantly influence phenotypic emergence. On the other hand, combining ML-identified candidate biomarkers into the pool of GWAS-identified candidate biomarkers resulted in the improved predictive accuracy of 90%, demonstrating the capability of ML as an auxiliary analysis to GWAS. Furthermore, some of the ML-identified candidate biomarkers were found to be linked with hepatocellular carcinoma (HCC), reinforcing previous claims that HCC can still occur despite the absence of HBsAg.
Collapse
Affiliation(s)
- Nicole Cathlene N Astrologo
- Data Analytics Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
- Computational Interdisciplinary Research Laboratory (CINTERLabs), University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
| | - Joverlyn D Gaudillo
- Data Analytics Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines.
- Computational Interdisciplinary Research Laboratory (CINTERLabs), University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines.
- Domingo AI Research Center (DARC Labs), 1606, Pasig, Philippines.
| | - Jason R Albia
- Domingo AI Research Center (DARC Labs), 1606, Pasig, Philippines
- Venn Biosciences Corporation Dba InterVenn Biosciences, Metro Manila, Pasig, Philippines
- Graduate School, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
| | - Ranzivelle Marianne L Roxas-Villanueva
- Data Analytics Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
- Computational Interdisciplinary Research Laboratory (CINTERLabs), University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
| |
Collapse
|
10
|
A machine-learning approach for nonalcoholic steatohepatitis susceptibility estimation. Indian J Gastroenterol 2022; 41:475-482. [PMID: 36367682 DOI: 10.1007/s12664-022-01263-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Accepted: 05/02/2022] [Indexed: 11/13/2022]
Abstract
BACKGROUND Nonalcoholic steatohepatitis (NASH), a severe form of nonalcoholic fatty liver disease, can lead to advanced liver damage and has become an increasingly prominent health problem worldwide. Predictive models for early identification of high-risk individuals could help identify preventive and interventional measures. Traditional epidemiological models with limited predictive power are based on statistical analysis. In the current study, a novel machine-learning approach was developed for individual NASH susceptibility prediction using candidate single nucleotide polymorphisms (SNPs). METHODS A total of 245 NASH patients and 120 healthy individuals were included in the study. Single nucleotide polymorphism genotypes of candidate genes including two SNPs in the cytochrome P450 family 2 subfamily E member 1 (CYP2E1) gene (rs6413432, rs3813867), two SNPs in the glucokinase regulator (GCKR) gene (rs780094, rs1260326), rs738409 SNP in patatin-like phospholipase domain-containing 3 (PNPLA3), and gender parameters were used to develop models for identifying at-risk individuals. To predict the individual's susceptibility to NASH, nine different machine-learning models were constructed. These models involved two different feature selections including Chi-square, and support vector machine recursive feature elimination (SVM-RFE) and three classification algorithms including k-nearest neighbor (KNN), multi-layer perceptron (MLP), and random forest (RF). All nine machine-learning models were trained using 80% of both the NASH patients and the healthy controls data. The nine machine-learning models were then tested on 20% of both groups. The model's performance was compared for model accuracy, precision, sensitivity, and F measure. RESULTS Among all nine machine-learning models, the KNN classifier with all features as input showed the highest performance with 86% F measure and 79% accuracy. CONCLUSIONS Machine learning based on genomic variety may be applicable for estimating an individual's susceptibility for developing NASH among high-risk groups with a high degree of accuracy, precision, and sensitivity.
Collapse
|
11
|
Silva PP, Gaudillo JD, Vilela JA, Roxas-Villanueva RML, Tiangco BJ, Domingo MR, Albia JR. A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci. Sci Rep 2022; 12:15817. [PMID: 36138111 PMCID: PMC9499949 DOI: 10.1038/s41598-022-19708-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 09/02/2022] [Indexed: 11/17/2022] Open
Abstract
Identifying disease-associated susceptibility loci is one of the most pressing and crucial challenges in modeling complex diseases. Existing approaches to biomarker discovery are subject to several limitations including underpowered detection, neglect for variant interactions, and restrictive dependence on prior biological knowledge. Addressing these challenges necessitates more ingenious ways of approaching the "missing heritability" problem. This study aims to discover disease-associated susceptibility loci by augmenting previous genome-wide association study (GWAS) using the integration of random forest and cluster analysis. The proposed integrated framework is applied to a hepatitis B virus surface antigen (HBsAg) seroclearance GWAS data. Multiple cluster analyses were performed on (1) single nucleotide polymorphisms (SNPs) considered significant by GWAS and (2) SNPs with the highest feature importance scores obtained using random forest. The resulting SNP-sets from the cluster analyses were subsequently tested for trait-association. Three susceptibility loci possibly associated with HBsAg seroclearance were identified: (1) SNP rs2399971, (2) gene LINC00578, and (3) locus 11p15. SNP rs2399971 is a biomarker reported in the literature to be significantly associated with HBsAg seroclearance in patients who had received antiviral treatment. The latter two loci are linked with diseases influenced by the presence of hepatitis B virus infection. These findings demonstrate the potential of the proposed integrated framework in identifying disease-associated susceptibility loci. With further validation, results herein could aid in better understanding complex disease etiologies and provide inputs for a more advanced disease risk assessment for patients.
Collapse
Affiliation(s)
- Princess P Silva
- Data-Driven Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
- Computational Interdisciplinary Research Laboratory (CINTERLabs), University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
| | - Joverlyn D Gaudillo
- Data-Driven Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines.
- Computational Interdisciplinary Research Laboratory (CINTERLabs), University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines.
- Domingo AI Research Center (DARC Labs), 1606, Pasig City, Philippines.
| | - Julianne A Vilela
- Philippine Genome Center Program for Agriculture, Office of the Vice Chancellor for Research and Extension, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
| | - Ranzivelle Marianne L Roxas-Villanueva
- Data-Driven Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
- Computational Interdisciplinary Research Laboratory (CINTERLabs), University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
| | - Beatrice J Tiangco
- National Institute of Health, UP College of Medicine, Taft Avenue, 1000, Manila, Philippines
- Division of Medicine, The Medical City, 1605, Pasig, Philippines
| | - Mario R Domingo
- Domingo AI Research Center (DARC Labs), 1606, Pasig City, Philippines
| | - Jason R Albia
- Data-Driven Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
- Domingo AI Research Center (DARC Labs), 1606, Pasig City, Philippines
- Venn Biosciences Corporation Dba InterVenn Biosciences, Metro Manila, Philippines
| |
Collapse
|
12
|
Tai KY, Dhaliwal J, Wong K. Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach. BMC Bioinformatics 2022; 23:325. [PMID: 35934714 PMCID: PMC9358850 DOI: 10.1186/s12859-022-04870-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 08/01/2022] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environmental data. However, to the best of our knowledge, no study analyses the contribution of Single Nucleotide Polymorphisms (SNPs) to malaria using a machine learning model. More specifically, this study aims to quantify an individual's susceptibility to the development of malaria by using risk scores obtained from the cumulative effects of SNPs, known as weighted genetic risk scores (wGRS). RESULTS We proposed an SNP-based feature extraction algorithm that incorporates the susceptibility information of an individual to malaria to generate the feature set. However, it can become computationally expensive for a machine learning model to learn from many SNPs. Therefore, we reduced the feature set by employing the Logistic Regression and Recursive Feature Elimination (LR-RFE) method to select SNPs that improve the efficacy of our model. Next, we calculated the wGRS of the selected feature set, which is used as the model's target variables. Moreover, to compare the performance of the wGRS-only model, we calculated and evaluated the combination of wGRS with genotype frequency (wGRS + GF). Finally, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), and Ridge regression algorithms are utilized to establish the machine learning models for malaria risk prediction. CONCLUSIONS Our proposed approach identified SNP rs334 as the most contributing feature with an importance score of 6.224 compared to the baseline, with an importance score of 1.1314. This is an important result as prior studies have proven that rs334 is a major genetic risk factor for malaria. The analysis and comparison of the three machine learning models demonstrated that LightGBM achieves the highest model performance with a Mean Absolute Error (MAE) score of 0.0373. Furthermore, based on wGRS + GF, all models performed significantly better than wGRS alone, in which LightGBM obtained the best performance (0.0033 MAE score).
Collapse
Affiliation(s)
- Kah Yee Tai
- School of Information Technology, Monash University Malaysia, Subang Jaya, Selangor, Malaysia
| | - Jasbir Dhaliwal
- School of Information Technology, Monash University Malaysia, Subang Jaya, Selangor, Malaysia.
| | - KokSheik Wong
- School of Information Technology, Monash University Malaysia, Subang Jaya, Selangor, Malaysia
| |
Collapse
|
13
|
Kumar R, Sharma A, Alexiou A, Bilgrami AL, Kamal MA, Ashraf GM. DeePred-BBB: A Blood Brain Barrier Permeability Prediction Model With Improved Accuracy. Front Neurosci 2022; 16:858126. [PMID: 35592264 PMCID: PMC9112838 DOI: 10.3389/fnins.2022.858126] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 03/14/2022] [Indexed: 11/13/2022] Open
Abstract
The blood-brain barrier (BBB) is a selective and semipermeable boundary that maintains homeostasis inside the central nervous system (CNS). The BBB permeability of compounds is an important consideration during CNS-acting drug development and is difficult to formulate in a succinct manner. Clinical experiments are the most accurate method of measuring BBB permeability. However, they are time taking and labor-intensive. Therefore, numerous efforts have been made to predict the BBB permeability of compounds using computational methods. However, the accuracy of BBB permeability prediction models has always been an issue. To improve the accuracy of the BBB permeability prediction, we applied deep learning and machine learning algorithms to a dataset of 3,605 diverse compounds. Each compound was encoded with 1,917 features containing 1,444 physicochemical (1D and 2D) properties, 166 molecular access system fingerprints (MACCS), and 307 substructure fingerprints. The prediction performance metrics of the developed models were compared and analyzed. The prediction accuracy of the deep neural network (DNN), one-dimensional convolutional neural network, and convolutional neural network by transfer learning was found to be 98.07, 97.44, and 97.61%, respectively. The best performing DNN-based model was selected for the development of the “DeePred-BBB” model, which can predict the BBB permeability of compounds using their simplified molecular input line entry system (SMILES) notations. It could be useful in the screening of compounds based on their BBB permeability at the preliminary stages of drug development. The DeePred-BBB is made available at https://github.com/12rajnish/DeePred-BBB.
Collapse
Affiliation(s)
- Rajnish Kumar
- Amity Institute of Biotechnology, Amity University Uttar Pradesh, Lucknow, India
| | - Anju Sharma
- Department of Applied Science, Indian Institute of Information Technology Allahabad, Prayagraj, India
| | - Athanasios Alexiou
- Department of Science and Engineering, Novel Global Community Educational Foundation, Hebersham, NSW, Australia
- AFNP Med Austria, Vienna, Austria
| | - Anwar L. Bilgrami
- Department of Entomology, Rutgers, The State University of New Jersey, New Brunswick, NJ, United States
- Deanship of Scientific Research, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mohammad Amjad Kamal
- Institutes for Systems Genetics, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
- King Fahd Medical Research Center, King Abdulaziz University, Jeddah, Saudi Arabia
- Department of Pharmacy, Faculty of Allied Health Sciences, Daffodil International University, Dhaka, Bangladesh
- Enzymoics, Hebersham, NSW, Australia
- Novel Global Community Educational Foundation, Hebersham, NSW, Australia
| | - Ghulam Md Ashraf
- Pre-Clinical Research Unit, King Fahd Medical Research Center, King Abdulaziz University, Jeddah, Saudi Arabia
- Department of Medical Laboratory Sciences, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
- *Correspondence: Ghulam Md Ashraf, ,
| |
Collapse
|
14
|
Verma P, Shakya M. Machine learning model for predicting Major Depressive Disorder using RNA-Seq data: optimization of classification approach. Cogn Neurodyn 2022; 16:443-453. [PMID: 35401859 PMCID: PMC8934793 DOI: 10.1007/s11571-021-09724-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2021] [Revised: 08/28/2021] [Accepted: 09/12/2021] [Indexed: 10/20/2022] Open
Abstract
Considering human brain disorders, Major Depressive Disorder (MDD) is seen as a lethal disease in which a person goes to the extent of suicidal behavior. Physical detection of MDD patients is less precise but machine learning can aid in improved classification of disease. The present research included three RNA-seq data classes to classify DEGs and then train key gene data using a random forest machine learning method. The three classes in the sample are 29 CON (sudden death healthy control), 21 MDD-S (a Major Depressive Disorder Suicide) being included in the second group, and 9 MDD (non-suicides MDD) which are included in the third group. With PCA analysis, 99 key genes were obtained. 47.1% data variability is given by these 99 genes. The model training of 99 genes indicated improved classification. The RF classification model has an accuracy of 61.11% over test data and 97.56% over train data. It was also noticed that the RF method offered greater accuracy than the KNN method. 99 genes were annotated using DAVID and ClueGo packages. Some of the important pathways and function observed in the study were glutamatergic synapse, GABA receptor activation, long-term synaptic depression, and morphine addiction. Out Of 99 genes, four genes, namely DLGAP1, GNG2, GRIA1, and GRIA4, were found to be predominantly involved in the glutamatergic synapse pathway. Another substantial link was observed in the GABA receptor activation involving the following two genes, GABBR2 and GNG2. Also, the genes found responsible for long-term synaptic depression were GRIA1, MAPT, and PTEN. There was another finding of morphine addiction which comprises three genes, namely GABBR2, GNG2, and PDE4D. For massive datasets, this approach will act as the gold standard. The cases of CON, MDD, and MDD-S are physically distinct. There was dysregulation in the expression level of 12 genes. The 12 genes act as a possible biomarker for Major Depressive Disorder and open up a new path for depressed subjects to explore further.
Collapse
Affiliation(s)
- Pragya Verma
- Department of Mathematics, Bioinformatics and Computer Applications, Maulana Azad National Institute of Technology, Bhopal, Madhya Pradesh, 462003 India
| | - Madhvi Shakya
- Department of Mathematics, Bioinformatics and Computer Applications, Maulana Azad National Institute of Technology, Bhopal, Madhya Pradesh, 462003 India
| |
Collapse
|
15
|
Makimoto H. Artificial Intelligence in Medicine (AIM) for Cardiac Arrest. Artif Intell Med 2022. [DOI: 10.1007/978-3-030-64573-1_175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
16
|
Wisgrill L, Werner P, Fortino V, Fyhrquist N. AIM in Allergy. Artif Intell Med 2022. [DOI: 10.1007/978-3-030-64573-1_90] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
17
|
Katsaouni N, Tashkandi A, Wiese L, Schulz MH. Machine learning based disease prediction from genotype data. Biol Chem 2021; 402:871-885. [PMID: 34218544 DOI: 10.1515/hsz-2021-0109] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 06/15/2021] [Indexed: 12/16/2022]
Abstract
Using results from genome-wide association studies for understanding complex traits is a current challenge. Here we review how genotype data can be used with different machine learning (ML) methods to predict phenotype occurrence and severity from genotype data. We discuss common feature encoding schemes and how studies handle the often small number of samples compared to the huge number of variants. We compare which ML methods are being applied, including recent results using deep neural networks. Further, we review the application of methods for feature explanation and interpretation.
Collapse
Affiliation(s)
- Nikoletta Katsaouni
- Institute for Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
| | - Araek Tashkandi
- Institute of Computer Sciences and Engineering, University of Jeddah, 21959 Jeddah, Saudi Arabia
| | - Lena Wiese
- Institute of Computer Science, Goethe University, 60629 Frankfurt am Main, Germany
| | - Marcel H Schulz
- Institute for Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research (DZHK), Partner Site RheinMain, 60590 Frankfurt am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, Frankfurt am Main, Germany
| |
Collapse
|
18
|
Varma M, Paskov KM, Chrisman BS, Sun MW, Jung JY, Stockham NT, Washington PY, Wall DP. A maximum flow-based network approach for identification of stable noncoding biomarkers associated with the multigenic neurological condition, autism. BioData Min 2021; 14:28. [PMID: 33941233 PMCID: PMC8091705 DOI: 10.1186/s13040-021-00262-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Accepted: 04/20/2021] [Indexed: 12/05/2022] Open
Abstract
Background Machine learning approaches for predicting disease risk from high-dimensional whole genome sequence (WGS) data often result in unstable models that can be difficult to interpret, limiting the identification of putative sets of biomarkers. Here, we design and validate a graph-based methodology based on maximum flow, which leverages the presence of linkage disequilibrium (LD) to identify stable sets of variants associated with complex multigenic disorders. Results We apply our method to a previously published logistic regression model trained to identify variants in simple repeat sequences associated with autism spectrum disorder (ASD); this L1-regularized model exhibits high predictive accuracy yet demonstrates great variability in the features selected from over 230,000 possible variants. In order to improve model stability, we extract the variants assigned non-zero weights in each of 5 cross-validation folds and then assemble the five sets of features into a flow network subject to LD constraints. The maximum flow formulation allowed us to identify 55 variants, which we show to be more stable than the features identified by the original classifier. Conclusion Our method allows for the creation of machine learning models that can identify predictive variants. Our results help pave the way towards biomarker-based diagnosis methods for complex genetic disorders. Supplementary Information The online version contains supplementary material available at 10.1186/s13040-021-00262-x.
Collapse
Affiliation(s)
- Maya Varma
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Kelley M Paskov
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | | | - Min Woo Sun
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Jae-Yoon Jung
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.,Department of Pediatrics, Stanford University, Stanford, CA, USA
| | - Nate T Stockham
- Department of Neuroscience, Stanford University, Stanford, CA, USA
| | | | - Dennis P Wall
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA. .,Department of Pediatrics, Stanford University, Stanford, CA, USA.
| |
Collapse
|
19
|
Abstract
While asthma has a strong genetic component, our current ability to systematically understand and predict asthma risk remains low, despite over a hundred genetic associations. The reasons for this unfilled gap range from technical limitations of current approaches to fundamental deficiencies in the way we understand asthma. These are discussed in the context of genomic advances.
Collapse
Affiliation(s)
- Mayank Bansal
- CSIR-Institute of Genomics and Integrative Biology, Delhi, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
| | - Mayank Garg
- CSIR-Institute of Genomics and Integrative Biology, Delhi, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
| | - Anurag Agrawal
- CSIR-Institute of Genomics and Integrative Biology, Delhi, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India.
| |
Collapse
|
20
|
Wisgrill L, Werner P, Fortino V, Fyhrquist N. AIM in Allergy. Artif Intell Med 2021. [DOI: 10.1007/978-3-030-58080-3_90-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
21
|
Makimoto H. Artificial Intelligence in Medicine (AIM) for Cardiac Arrest. Artif Intell Med 2021. [DOI: 10.1007/978-3-030-58080-3_175-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/03/2023]
|