1
|
Jiang X, Xu B, Li Q, Zhao YE. Association between Plasma Metabolite Levels and Myopia: A 2-Sample Mendelian Randomization Study. OPHTHALMOLOGY SCIENCE 2025; 5:100699. [PMID: 40124309 PMCID: PMC11930157 DOI: 10.1016/j.xops.2024.100699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Revised: 12/15/2024] [Accepted: 12/26/2024] [Indexed: 03/25/2025]
Abstract
Purpose The role of plasma metabolites in myopia is still unclear, and previous studies are limited by various factors and were mostly observational. This study aims to investigate the causal relationship between plasma metabolites and myopia using 2-sample Mendelian randomization (MR). Design A 2-sample MR study. Subjects and Participants This study analyzed plasma metabolites consisting of 1091 metabolites and 309 metabolite ratios in 8299 individuals from the Canadian Longitudinal Study on Aging cohort. Summary statistics for myopia were obtained from the UK Biobank, encompassing 37 362 cases and 460 536 controls. Methods Causal effect estimates were primarily derived using the inverse variance weighting (IVW) method and the constrained maximum likelihood and model averaging-based MR method. Statistical significance for the MR effect estimate was defined as a false discovery rate (FDR) of <0.05. Additionally, we used the MR Steiger directionality test to examine whether exposure was directionally causal for the outcome. Furthermore, 4 supplementary methods were used for analysis: weighted median, MR-Egger, simple mode, and weighted mode. Main Outcome Measures Genetic causal association between plasma metabolites and myopia. Results The IVW analysis results indicated that elevated levels of 1-arachidonoyl-GPE (20:4n6) (P_FDR = 5.80E-06), linoleoyl-arachidonoyl glycerol (18:2/20:4)[1] (P_FDR = 2.24E-06), and linoleoyl-arachidonoyl-glycerol (18:2/20:4) [2](P_FDR = 0.0242) have a protective effect on myopia. Elevated levels of 4 plasma metabolite ratios, including the phosphate to linoleoyl-arachidonoyl-glycerol (18:2/20:4) [2] ratio (P_FDR = 0.0029), citrulline to dimethylarginine (SDMA + ADMA) ratio (P_FDR = 0.0207), oleoyl-linoleoyl-glycerol (18:1/18:2) [2] to linoleoyl-arachidonoyl-glycerol (18:2/20:4) [1] ratio (P_FDR = 0.0230), and retinol (vitamin A) to linoleoyl-arachidonoyl-glycerol (18:2/20:4) [2] ratio (P_FDR = 0.0230), were significantly associated with a higher risk of myopia. Conclusions This study provides evidence of a causal relationship between specific plasma metabolites and myopia, highlighting potential therapeutic targets and contributing to the understanding of myopia's etiology. Future research should include diverse populations to enhance the generalizability of these findings. Financial Disclosures The author(s) have no proprietary or commercial interest in any materials discussed in this article.
Collapse
Affiliation(s)
- Xiaohui Jiang
- Eye Hospital of Wenzhou Medical University at Hangzhou, Hangzhou, China
| | - Boyue Xu
- Eye Hospital, School of Ophthalmology & Optometry, Wenzhou Medical University, Wenzhou, China
| | - Qiyuan Li
- Eye Hospital of Wenzhou Medical University at Hangzhou, Hangzhou, China
| | - Yun-e Zhao
- Eye Hospital of Wenzhou Medical University at Hangzhou, Hangzhou, China
| |
Collapse
|
2
|
Ghavidel A, Pazos P. Machine learning (ML) techniques to predict breast cancer in imbalanced datasets: a systematic review. J Cancer Surviv 2025; 19:270-294. [PMID: 37749361 DOI: 10.1007/s11764-023-01465-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 09/09/2023] [Indexed: 09/27/2023]
Abstract
Knowledge discovery in databases (KDD) is crucial in analyzing data to extract valuable insights. In medical outcome prediction, KDD is increasingly applied, particularly in diseases with high incidence, mortality, and costs, like cancer. ML techniques can develop more accurate predictive models for cancer patients' clinical outcomes, aiding informed healthcare decision-making. However, cancer prediction modeling faces challenges because of the unbalanced nature of the datasets, where there is a small minority category of patients with a cancer diagnosis compared to a majority category of cancer-free patients. Imbalanced datasets pose statistical hurdles like bias and overfitting when developing accurate prediction models. This systematic review focuses on breast cancer prediction articles published from 2008 to 2023. The objective is to examine ML methods used in three critical steps of KDD: preprocessing, data mining, and interpretation which address the imbalanced data problem in breast cancer prediction. This work synthesizes prior research in ML methods for breast cancer prediction. The findings help identify effective preprocessing strategies, including balancing and feature selection methods, robust predictive models, and evaluation metrics of those models. The study aims to inform healthcare providers and researchers about effective techniques for accurate breast cancer prediction.
Collapse
Affiliation(s)
- Arman Ghavidel
- Engineering Management and Systems Engineering, Old Dominion University, Norfolk, VA, USA
| | - Pilar Pazos
- Engineering Management and Systems Engineering, Old Dominion University, Norfolk, VA, USA.
| |
Collapse
|
3
|
Tamal M, Althobaiti M, Alhashim M, Alsanea M, Hegazi TM, Deriche M, Alhashem AM. Radiomic features based automatic classification of CT lung findings for COVID-19 patients. Biomed Phys Eng Express 2024; 11:015012. [PMID: 39530647 DOI: 10.1088/2057-1976/ad9157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Accepted: 11/12/2024] [Indexed: 11/16/2024]
Abstract
Introduction. The lung CT images of COVID-19 patients can be typically characterized by three different findings- Ground Glass Opacity (GGO), consolidation and pleural effusion. GGOs have been shown to precede consolidations and has different heterogeneous appearance. Conventional severity scoring only uses total area of lung involvement ignoring appearance of the effected regions. This study proposes a baseline to select heterogeneity/radiomic features that can distinguish these three pathological lung findings.Methods. Four approaches were implemented to select features from a pool of 44 features. First one is a manual feature selection method. The rest are automatic feature selection methods based on Genetic Algorithm (GA) coupled with (1) K-Nearest-Neighbor (GA-KNN), (2) binary-decision-tree (GA-BDT) and (3) Artificial-Neural-Network (GA-ANN). For the purpose of validation, an ANN was trained using the selected features and tested on a completely independent data set.Results. Manual selection of nine radiomic features was found to provide the most accurate results with the highest sensitivity, specificity and accuracy (85.7% overall accuracy and 0.90 area under receiver operating characteristic curve) followed by GA-BDT, GA-KNN and GA-ANN (accuracy 78%, 77.5% and 76.8%).Conclusion. Manually selected nine radiomic features can be used in accurate severity scoring allowing the clinician to plan for more effective personalized treatment. They can also be useful for monitoring the progression of COVID-19 and response to therapy for clinical trials.
Collapse
Affiliation(s)
- Mahbubunnabi Tamal
- Department of Biomedical Engineering, College of Engineering, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
| | - Murad Althobaiti
- Department of Biomedical Engineering, College of Engineering, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
| | - Maryam Alhashim
- Department of medical physics, King Fahad Specialist Hospital Dammam, Dammam 32253, Saudi Arabia
- Department of Radiology, College of Medicine, Imam Abdulrahman Bin Faisal University, PO Box 1982, Dammam 31441, Saudi Arabia
| | - Maram Alsanea
- Department of medical physics, King Fahad Specialist Hospital Dammam, Dammam 32253, Saudi Arabia
| | - Tarek M Hegazi
- Department of Radiology, College of Medicine, Imam Abdulrahman Bin Faisal University, PO Box 1982, Dammam 31441, Saudi Arabia
| | - Mohamed Deriche
- Artificial Intelligence Research Centre, AIRC, Ajman University, United Arab Emirates
| | - Abdullah M Alhashem
- Neuroradiology Consultant, Radiology Department, Prince Sultan Military Medical City, Riyadh, Saudi Arabia
| |
Collapse
|
4
|
Ostojic D, Lalousis PA, Donohoe G, Morris DW. The challenges of using machine learning models in psychiatric research and clinical practice. Eur Neuropsychopharmacol 2024; 88:53-65. [PMID: 39232341 DOI: 10.1016/j.euroneuro.2024.08.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 08/06/2024] [Accepted: 08/12/2024] [Indexed: 09/06/2024]
Abstract
To understand the complex nature of heterogeneous psychiatric disorders, scientists and clinicians are required to employ a wide range of clinical, endophenotypic, neuroimaging, genomic, and environmental data to understand the biological mechanisms of psychiatric illness before this knowledge is applied into clinical setting. Machine learning (ML) is an automated process that can detect patterns from large multidimensional datasets and can supersede conventional statistical methods as it can detect both linear and non-linear relationships. Due to this advantage, ML has potential to enhance our understanding, improve diagnosis, prognosis and treatment of psychiatric disorders. The current review provides an in-depth examination of, and offers practical guidance for, the challenges encountered in the application of ML models in psychiatric research and clinical practice. These challenges include the curse of dimensionality, data quality, the 'black box' problem, hyperparameter tuning, external validation, class imbalance, and data representativeness. These challenges are particularly critical in the context of psychiatry as it is expected that researchers will encounter them during the stages of ML model development and deployment. We detail practical solutions and best practices to effectively mitigate the outlined challenges. These recommendations have the potential to improve reliability and interpretability of ML models in psychiatry.
Collapse
Affiliation(s)
- Dijana Ostojic
- School of Biological and Chemical Sciences and School of Psychology, Centre for Neuroimaging, Cognition and Genomics (NICOG), University of Galway, Ireland
| | - Paris Alexandros Lalousis
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom; Section for Precision Psychiatry, Department of Psychiatry and Psychotherapy, Ludwig-Maximilian-University Munich, Munich, Germany
| | - Gary Donohoe
- School of Biological and Chemical Sciences and School of Psychology, Centre for Neuroimaging, Cognition and Genomics (NICOG), University of Galway, Ireland
| | - Derek W Morris
- School of Biological and Chemical Sciences and School of Psychology, Centre for Neuroimaging, Cognition and Genomics (NICOG), University of Galway, Ireland.
| |
Collapse
|
5
|
Zaharieva MS, Salvadori EA, Messinger DS, Visser I, Colonnesi C. Automated facial expression measurement in a longitudinal sample of 4- and 8-month-olds: Baby FaceReader 9 and manual coding of affective expressions. Behav Res Methods 2024; 56:5709-5731. [PMID: 38273072 PMCID: PMC11335827 DOI: 10.3758/s13428-023-02301-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/20/2023] [Indexed: 01/27/2024]
Abstract
Facial expressions are among the earliest behaviors infants use to express emotional states, and are crucial to preverbal social interaction. Manual coding of infant facial expressions, however, is laborious and poses limitations to replicability. Recent developments in computer vision have advanced automated facial expression analyses in adults, providing reproducible results at lower time investment. Baby FaceReader 9 is commercially available software for automated measurement of infant facial expressions, but has received little validation. We compared Baby FaceReader 9 output to manual micro-coding of positive, negative, or neutral facial expressions in a longitudinal dataset of 58 infants at 4 and 8 months of age during naturalistic face-to-face interactions with the mother, father, and an unfamiliar adult. Baby FaceReader 9's global emotional valence formula yielded reasonable classification accuracy (AUC = .81) for discriminating manually coded positive from negative/neutral facial expressions; however, the discrimination of negative from neutral facial expressions was not reliable (AUC = .58). Automatically detected a priori action unit (AU) configurations for distinguishing positive from negative facial expressions based on existing literature were also not reliable. A parsimonious approach using only automatically detected smiling (AU12) yielded good performance for discriminating positive from negative/neutral facial expressions (AUC = .86). Likewise, automatically detected brow lowering (AU3+AU4) reliably distinguished neutral from negative facial expressions (AUC = .79). These results provide initial support for the use of selected automatically detected individual facial actions to index positive and negative affect in young infants, but shed doubt on the accuracy of complex a priori formulas.
Collapse
Affiliation(s)
- Martina S Zaharieva
- Department of Developmental Psychology, Faculty of Social and Behavioural Sciences, University of Amsterdam, Nieuwe Achtergracht 129b, 1001 NK, Amsterdam, The Netherlands.
- Developmental Psychopathology Unit, Development and Education, Faculty of Social and Behavioural Sciences, Research Institute of Child, University of Amsterdam, Nieuwe Achtergracht 129b, 1001 NK, Amsterdam, The Netherlands.
- Yield, Research Priority Area, University of Amsterdam, Amsterdam, The Netherlands.
| | - Eliala A Salvadori
- Developmental Psychopathology Unit, Development and Education, Faculty of Social and Behavioural Sciences, Research Institute of Child, University of Amsterdam, Nieuwe Achtergracht 129b, 1001 NK, Amsterdam, The Netherlands
- Yield, Research Priority Area, University of Amsterdam, Amsterdam, The Netherlands
| | - Daniel S Messinger
- Department of Psychology, University of Miami, Coral Gables, FL, USA
- Department of Pediatrics, University of Miami, Coral Gables, FL, USA
- Department of Music Engineering, University of Miami, Coral Gables, FL, USA
- Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL, USA
| | - Ingmar Visser
- Department of Developmental Psychology, Faculty of Social and Behavioural Sciences, University of Amsterdam, Nieuwe Achtergracht 129b, 1001 NK, Amsterdam, The Netherlands
- Yield, Research Priority Area, University of Amsterdam, Amsterdam, The Netherlands
| | - Cristina Colonnesi
- Developmental Psychopathology Unit, Development and Education, Faculty of Social and Behavioural Sciences, Research Institute of Child, University of Amsterdam, Nieuwe Achtergracht 129b, 1001 NK, Amsterdam, The Netherlands
- Yield, Research Priority Area, University of Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
6
|
Tran TO, Vo TH, Le NQK. Omics-based deep learning approaches for lung cancer decision-making and therapeutics development. Brief Funct Genomics 2024; 23:181-192. [PMID: 37519050 DOI: 10.1093/bfgp/elad031] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 07/04/2023] [Accepted: 07/13/2023] [Indexed: 08/01/2023] Open
Abstract
Lung cancer has been the most common and the leading cause of cancer deaths globally. Besides clinicopathological observations and traditional molecular tests, the advent of robust and scalable techniques for nucleic acid analysis has revolutionized biological research and medicinal practice in lung cancer treatment. In response to the demands for minimally invasive procedures and technology development over the past decade, many types of multi-omics data at various genome levels have been generated. As omics data grow, artificial intelligence models, particularly deep learning, are prominent in developing more rapid and effective methods to potentially improve lung cancer patient diagnosis, prognosis and treatment strategy. This decade has seen genome-based deep learning models thriving in various lung cancer tasks, including cancer prediction, subtype classification, prognosis estimation, cancer molecular signatures identification, treatment response prediction and biomarker development. In this study, we summarized available data sources for deep-learning-based lung cancer mining and provided an update on recent deep learning models in lung cancer genomics. Subsequently, we reviewed the current issues and discussed future research directions of deep-learning-based lung cancer genomics research.
Collapse
Affiliation(s)
- Thi-Oanh Tran
- International Ph.D. Program in Cell Therapy and Regenerative Medicine, College of Medicine, Taipei Medical University, No 250 Wuxing Street, 110, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, No 250 Wuxing Street, 110, Taipei, Taiwan
- Hematology and Blood Transfusion Center, Bach Mai Hospital, No 78 Giai Phong Street, Hanoi, Viet Nam
| | - Thanh Hoa Vo
- Department of Science, School of Science and Computing, South East Technological University, Waterford X91 K0EK, Ireland
- Pharmaceutical and Molecular Biotechnology Research Center (PMBRC), South East Technological University, Waterford X91 K0EK, Ireland
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, No 250 Wuxing Street, 110, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, 252 Wuxing Street, 110, Taipei, Taiwan
| |
Collapse
|
7
|
Wild R, Sozio E, Margiotta RG, Dellai F, Acquasanta A, Del Ben F, Tascini C, Curcio F, Laio A. Maximally informative feature selection using Information Imbalance: Application to COVID-19 severity prediction. Sci Rep 2024; 14:10744. [PMID: 38730063 PMCID: PMC11087653 DOI: 10.1038/s41598-024-61334-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 05/04/2024] [Indexed: 05/12/2024] Open
Abstract
Clinical databases typically include, for each patient, many heterogeneous features, for example blood exams, the clinical history before the onset of the disease, the evolution of the symptoms, the results of imaging exams, and many others. We here propose to exploit a recently developed statistical approach, the Information Imbalance, to compare different subsets of patient features and automatically select the set of features that is maximally informative for a given clinical purpose, especially in minority classes. We adapt the Information Imbalance approach to work in a clinical framework, where patient features are often categorical and are generally available only for a fraction of the patients. We apply this algorithm to a data set of ∼ 1300 patients treated for COVID-19 in Udine hospital before October 2021. Using this approach, we find combinations of features which, if used in combination, are maximally informative of the clinical fate and of the severity of the disease. The optimal number of features, which is determined automatically, turns out to be between 10 and 15. These features can be measured at admission. The approach can be used also if the features are available only for a fraction of the patients, does not require imputation and, importantly, is able to automatically select features with small inter-feature correlation. Clinical insights deriving from this study are also discussed.
Collapse
Affiliation(s)
- Romina Wild
- International School for Advanced Studies (SISSA), Via Bonomea 265, 34136, Trieste, Italy
| | - Emanuela Sozio
- Infectious Disease Unit, Azienda Sanitaria Universitaria Friuli Centrale (ASU FC), Via Pozzuolo 330, 33100, Udine, Italy
- Department of Medicine (DAME), University of Udine, Via Palladio 8, 33100, Udine, Italy
| | - Riccardo G Margiotta
- International School for Advanced Studies (SISSA), Via Bonomea 265, 34136, Trieste, Italy
| | - Fabiana Dellai
- Infectious Disease Unit, Azienda Sanitaria Universitaria Friuli Centrale (ASU FC), Via Pozzuolo 330, 33100, Udine, Italy
| | - Angela Acquasanta
- Infectious Disease Unit, Azienda Sanitaria Universitaria Friuli Centrale (ASU FC), Via Pozzuolo 330, 33100, Udine, Italy
| | - Fabio Del Ben
- Department of Medicine (DAME), University of Udine, Via Palladio 8, 33100, Udine, Italy
| | - Carlo Tascini
- Infectious Disease Unit, Azienda Sanitaria Universitaria Friuli Centrale (ASU FC), Via Pozzuolo 330, 33100, Udine, Italy
- Department of Medicine (DAME), University of Udine, Via Palladio 8, 33100, Udine, Italy
| | - Francesco Curcio
- Infectious Disease Unit, Azienda Sanitaria Universitaria Friuli Centrale (ASU FC), Via Pozzuolo 330, 33100, Udine, Italy
- Department of Medicine (DAME), University of Udine, Via Palladio 8, 33100, Udine, Italy
| | - Alessandro Laio
- International School for Advanced Studies (SISSA), Via Bonomea 265, 34136, Trieste, Italy.
- The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34151, Trieste, Italy.
| |
Collapse
|
8
|
Geuenich MJ, Gong DW, Campbell KR. The impacts of active and self-supervised learning on efficient annotation of single-cell expression data. Nat Commun 2024; 15:1014. [PMID: 38307875 PMCID: PMC10837127 DOI: 10.1038/s41467-024-45198-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Accepted: 01/16/2024] [Indexed: 02/04/2024] Open
Abstract
A crucial step in the analysis of single-cell data is annotating cells to cell types and states. While a myriad of approaches has been proposed, manual labeling of cells to create training datasets remains tedious and time-consuming. In the field of machine learning, active and self-supervised learning methods have been proposed to improve the performance of a classifier while reducing both annotation time and label budget. However, the benefits of such strategies for single-cell annotation have yet to be evaluated in realistic settings. Here, we perform a comprehensive benchmarking of active and self-supervised labeling strategies across a range of single-cell technologies and cell type annotation algorithms. We quantify the benefits of active learning and self-supervised strategies in the presence of cell type imbalance and variable similarity. We introduce adaptive reweighting, a heuristic procedure tailored to single-cell data-including a marker-aware version-that shows competitive performance with existing approaches. In addition, we demonstrate that having prior knowledge of cell type markers improves annotation accuracy. Finally, we summarize our findings into a set of recommendations for those implementing cell type annotation procedures or platforms. An R package implementing the heuristic approaches introduced in this work may be found at https://github.com/camlab-bioml/leader .
Collapse
Affiliation(s)
- Michael J Geuenich
- Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, M5G 1×5, Canada.
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada.
| | - Dae-Won Gong
- Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, M5G 1×5, Canada
| | - Kieran R Campbell
- Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, M5G 1×5, Canada.
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada.
- Department of Statistical Sciences, University of Toronto, Toronto, ON, M5S 3G3, Canada.
- Department of Computer Science, University of Toronto, Toronto, ON, M5T 3A1, Canada.
- Ontario Institute of Cancer Research, Toronto, ON, M5G 1M1, Canada.
- Vector Institute, Toronto, ON, M5G 1M1, Canada.
| |
Collapse
|
9
|
Choi Y, Cha J, Choi S. Evaluation of penalized and machine learning methods for asthma disease prediction in the Korean Genome and Epidemiology Study (KoGES). BMC Bioinformatics 2024; 25:56. [PMID: 38308205 PMCID: PMC10837879 DOI: 10.1186/s12859-024-05677-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 01/26/2024] [Indexed: 02/04/2024] Open
Abstract
BACKGROUND Genome-wide association studies have successfully identified genetic variants associated with human disease. Various statistical approaches based on penalized and machine learning methods have recently been proposed for disease prediction. In this study, we evaluated the performance of several such methods for predicting asthma using the Korean Chip (KORV1.1) from the Korean Genome and Epidemiology Study (KoGES). RESULTS First, single-nucleotide polymorphisms were selected via single-variant tests using logistic regression with the adjustment of several epidemiological factors. Next, we evaluated the following methods for disease prediction: ridge, least absolute shrinkage and selection operator, elastic net, smoothly clipped absolute deviation, support vector machine, random forest, boosting, bagging, naïve Bayes, and k-nearest neighbor. Finally, we compared their predictive performance based on the area under the curve of the receiver operating characteristic curves, precision, recall, F1-score, Cohen's Kappa, balanced accuracy, error rate, Matthews correlation coefficient, and area under the precision-recall curve. Additionally, three oversampling algorithms are used to deal with imbalance problems. CONCLUSIONS Our results show that penalized methods exhibit better predictive performance for asthma than that achieved via machine learning methods. On the other hand, in the oversampling study, randomforest and boosting methods overall showed better prediction performance than penalized methods.
Collapse
Affiliation(s)
- Yongjun Choi
- Department of Applied Artificial Intelligence, College of Computing, Hanyang University, 55 Hanyang-daehak-ro, Sangnok-gu, Ansan, 15588, South Korea
| | - Junho Cha
- Department of Applied Artificial Intelligence, College of Computing, Hanyang University, 55 Hanyang-daehak-ro, Sangnok-gu, Ansan, 15588, South Korea
| | - Sungkyoung Choi
- Department of Applied Artificial Intelligence, College of Computing, Hanyang University, 55 Hanyang-daehak-ro, Sangnok-gu, Ansan, 15588, South Korea.
- Department of Mathematical Data Science, College of Science and Convergence Technology, Hanyang University, 55 Hanyang-daehak-ro, Sangnok-gu, Ansan, 15588, South Korea.
| |
Collapse
|
10
|
Wei J, Shen N, Shi C, Li N, Yin C, Feng Y, Lu H, Yang X, Zhou L. Exploration of Serum lipid levels during twin pregnancy. J Matern Fetal Neonatal Med 2023; 36:2254891. [PMID: 37710986 DOI: 10.1080/14767058.2023.2254891] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 08/28/2023] [Accepted: 08/29/2023] [Indexed: 09/16/2023]
Abstract
Objective: This study aims to characterize changes in serum lipid levels throughout twin pregnancies and explore the relationship between lipid levels and gestational diabetes mellitus (GDM) and hypertensive disorders complicating pregnancy (HDCP).Methods: We retrospectively studied 297 twin pregnancies of women who received regular prenatal care and delivered at the Beijing Obstetrics and Gynecology Hospital over a period of two years. Demographic and medical data of the participants were collected by questionnaires and medical records review. Serum lipid levels were measured in the first trimester (6-13 weeks), second trimester (24-28 weeks), and third trimester (34-37 weeks). A multivariate regression model was constructed to examine the association between lipid levels and pregnancy complications. A decision tree was used to explore the relationship between early serum lipid glucose levels and GDM and HDCP in twin pregnancies.Results: Triglyceride (TG), total cholesterol (TC) and low-density lipoprotein cholesterol (LDL-C) levels increased significantly from the first trimester to the third trimester, with the exception of high-density lipoprotein cholesterol (HDL-C), which decreased in the third trimester in twin pregnancies (p < 0.001). The levels of TC in the GDM and HDCP group were significantly elevated compared to those in the normal group in early pregnancies (p < 0.05, p < 0.05). In the second trimester, TG in the HDCP group was substantially higher than that in the normal group (p = 0.01). In the third trimester, LDL-C and HDL-C levels in the GDM group are significantly lower than that in the normal group (p < 0.05, p < 0.05). After adjusting for confounders, body mass index (BMI) is independently associated with GDM (odds ratio [OR] = 1.129, 95% confidence interval [CI]: 1.007-1.266) and HDCP(odds ratio [OR] = 1.170, 95% confidence interval [CI]: 1.031-1.329). The variation amplitude of HDL-C in the third trimester is related to the occurrence of GDM and HDCP(GDM:OR = 0.271, 95%CI: 0.095-0.778; HDCP: OR =0.249, 95% CI: 0.075-0.823). TG and TC levels in DCDA twins were significantly higher than that in MCDA twins in the first trimester(TG: p < 0.05, TC: p < 0.05). In the decision tree model for GDM, fasting blood glucose in the first trimester (FBG), TC, and pre-pregnancy BMI were identified as important nodes, while in the HDCP model, pre-pregnancy BMI, TC, and TG were key nodes.Conclusion: Serum lipid levels in twin pregnancies increase gradually during pregnancy. BMI is independently associated with the occurrence of GDM and HDCP. HDL-C may serve as a protective factor for GDM and HDCP. The predictive effect of early blood lipid on GDM and HDCP in twin pregnancy needs further study.
Collapse
Affiliation(s)
- Jianxia Wei
- Department of Obstetrics, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing, China
- Beijing Maternal and Child Health Care Hospital, Beijing, China
| | - Nan Shen
- Department of Obstetrics, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing, China
- Beijing Maternal and Child Health Care Hospital, Beijing, China
| | - Cuixia Shi
- Department of Obstetrics, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing, China
- Beijing Maternal and Child Health Care Hospital, Beijing, China
| | - Na Li
- Department of Obstetrics, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing, China
- Beijing Maternal and Child Health Care Hospital, Beijing, China
| | - Chunnan Yin
- Department of Obstetrics, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing, China
- Beijing Maternal and Child Health Care Hospital, Beijing, China
| | - Yi Feng
- Department of Obstetrics, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing, China
- Beijing Maternal and Child Health Care Hospital, Beijing, China
| | - Hongyan Lu
- Department of Obstetrics, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing, China
- Beijing Maternal and Child Health Care Hospital, Beijing, China
| | - Xiaokui Yang
- Beijing Maternal and Child Health Care Hospital, Beijing, China
- Department of Human Reproductive Medicine, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing, China
| | - Li Zhou
- Department of Obstetrics, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing, China
- Beijing Maternal and Child Health Care Hospital, Beijing, China
| |
Collapse
|
11
|
Lu Y, Ren C, Wu C. In-Hospital Mortality Prediction Model for Critically Ill Older Adult Patients Transferred from the Emergency Department to the Intensive Care Unit. Risk Manag Healthc Policy 2023; 16:2555-2563. [PMID: 38024492 PMCID: PMC10676667 DOI: 10.2147/rmhp.s442138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 11/15/2023] [Indexed: 12/01/2023] Open
Abstract
Purpose Studies on the prognosis of critically ill older adult patients admitted to the emergency department (ED) but requiring immediate admission to the intensive care unit (ICU) remain limited. This study aimed to develop an in-hospital mortality prediction model for critically ill older adult patients transferred from the ED to the ICU. Patients and Methods The training cohort was taken from the Medical Information Mart for Intensive Care IV (version 2.2) database, and the external validation cohort was taken from the Affiliated Dongyang Hospital of Wenzhou Medical University. In the training cohort, class balance was addressed using Random Over Sampling Examples (ROSE). Univariate and multivariate Cox regression analyses were performed to identify independent risk factors. These were then integrated into the predictive nomogram. In the validation cohort, the predictive performance of the nomogram was evaluated using the area under the curve (AUC) of the receiver operating characteristic curve, calibration curve, clinical utility decision curve analysis (DCA), and clinical impact curve (CIC). Results In the ROSE-balanced training cohort, univariate and multivariate Cox regression analysis identified that age, sex, Glasgow coma scale score, malignant cancer, sepsis, use of mechanical ventilation, use of vasoactive agents, white blood cells, potassium, and creatinine were independent predictors of in-hospital mortality in critically ill older adult patients, and were included in the nomogram. The nomogram showed good predictive performance in the ROSE-balanced training cohort (AUC [95% confidence interval]: 0.792 [0.783-0.801]) and validation cohort (AUC [95% confidence interval]: 0.780 [0.727-0.834]). The calibration curves were well-fitted. DCA and CIC demonstrated that the nomogram has good clinical application value. Conclusion This study developed a predictive model for early prediction of in-hospital mortality in critically ill older adult patients transferred from the ED to the ICU, which was validated by external data and has good predictive performance.
Collapse
Affiliation(s)
- Yan Lu
- Clinical Laboratory, Affiliated Dongyang Hospital of Wenzhou Medical University, Dongyang, Zhejiang, 322100, People’s Republic of China
| | - Chaoxiang Ren
- Clinical Laboratory, Affiliated Dongyang Hospital of Wenzhou Medical University, Dongyang, Zhejiang, 322100, People’s Republic of China
| | - Chaolong Wu
- Clinical Laboratory, Affiliated Dongyang Hospital of Wenzhou Medical University, Dongyang, Zhejiang, 322100, People’s Republic of China
| |
Collapse
|
12
|
Oei CW, Ng EYK, Ng MHS, Tan RS, Chan YM, Chan LG, Acharya UR. Explainable Risk Prediction of Post-Stroke Adverse Mental Outcomes Using Machine Learning Techniques in a Population of 1780 Patients. SENSORS (BASEL, SWITZERLAND) 2023; 23:7946. [PMID: 37766004 PMCID: PMC10538068 DOI: 10.3390/s23187946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 09/13/2023] [Accepted: 09/15/2023] [Indexed: 09/29/2023]
Abstract
Post-stroke depression and anxiety, collectively known as post-stroke adverse mental outcome (PSAMO) are common sequelae of stroke. About 30% of stroke survivors develop depression and about 20% develop anxiety. Stroke survivors with PSAMO have poorer health outcomes with higher mortality and greater functional disability. In this study, we aimed to develop a machine learning (ML) model to predict the risk of PSAMO. We retrospectively studied 1780 patients with stroke who were divided into PSAMO vs. no PSAMO groups based on results of validated depression and anxiety questionnaires. The features collected included demographic and sociological data, quality of life scores, stroke-related information, medical and medication history, and comorbidities. Recursive feature elimination was used to select features to input in parallel to eight ML algorithms to train and test the model. Bayesian optimization was used for hyperparameter tuning. Shapley additive explanations (SHAP), an explainable AI (XAI) method, was applied to interpret the model. The best performing ML algorithm was gradient-boosted tree, which attained 74.7% binary classification accuracy. Feature importance calculated by SHAP produced a list of ranked important features that contributed to the prediction, which were consistent with findings of prior clinical studies. Some of these factors were modifiable, and potentially amenable to intervention at early stages of stroke to reduce the incidence of PSAMO.
Collapse
Affiliation(s)
- Chien Wei Oei
- Management Information Department, Office of Clinical Epidemiology, Analytics and kNowledge (OCEAN), Tan Tock Seng Hospital, Singapore 308433, Singapore;
- School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore 639798, Singapore
| | - Eddie Yin Kwee Ng
- School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore 639798, Singapore
| | - Matthew Hok Shan Ng
- Rehabilitation Research Institute of Singapore, Nanyang Technological University, Singapore 308232, Singapore;
| | - Ru-San Tan
- National Heart Centre Singapore, Singapore 169609, Singapore;
- Duke-NUS Medical School, Singapore 169857, Singapore
| | - Yam Meng Chan
- Department of General Surgery, Vascular Surgery Service, Tan Tock Seng Hospital, Singapore 308433, Singapore;
| | - Lai Gwen Chan
- Department of Psychiatry, Tan Tock Seng Hospital, Singapore 308433, Singapore;
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore 308232, Singapore
| | - Udyavara Rajendra Acharya
- School of Mathematics, Physics and Computing, University of Southern Queensland, Springfield, QLD 4305, Australia;
| |
Collapse
|
13
|
Li J, Huang Y, Hutton GJ, Aparasu RR. Assessing treatment switch among patients with multiple sclerosis: A machine learning approach. EXPLORATORY RESEARCH IN CLINICAL AND SOCIAL PHARMACY 2023; 11:100307. [PMID: 37554927 PMCID: PMC10405092 DOI: 10.1016/j.rcsop.2023.100307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 07/08/2023] [Accepted: 07/09/2023] [Indexed: 08/10/2023] Open
Abstract
BACKGROUND Patients with multiple sclerosis (MS) frequently switch their Disease-Modifying Agents (DMA) for effectiveness and safety concerns. This study aimed to develop and compare the random forest (RF) machine learning (ML) model with the logistic regression (LR) model for predicting DMA switching among MS patients. METHODS This retrospective longitudinal study used the TriNetX data from a federated electronic medical records (EMR) network. Between September 2010 and May 2017, adults (aged ≥18) MS patients with ≥1 DMA prescription were identified, and the earliest DMA date was assigned as the index date. Patients prescribed any DMAs different from their index DMAs were considered as treatment switch. . The RF and LR models were built with 72 baseline characteristics and trained with 70% of the randomly split data after up-sampling. Area Under the Curves (AUC), accuracy, recall, G-measure, and F-1 score were used to evaluate the model performance. RESULTS In this study, 7258 MS patients with ≥1 DMA were identified. Within two years, 16% of MS patients switched to a different DMA. The RF model obtained significantly better discrimination than the LR model (AUC = 0.65 vs. 0.63, p < 0.0001); however, the RF model had a similar predictive performance to the LR model with respect to F- and G-measures (RF: 72% and 73% vs. LR: 72% and 73%, respectively). The most influential features identified from the RF model were age, type of index medication, and year of index. CONCLUSIONS Compared to the LR model, RF performed better in predicting DMA switch in MS patients based on AUC measures; however, judged by F- and G-measures, the RF model performed similarly to LR. Further research is needed to understand the role of ML techniques in predicting treatment outcomes for the decision-making process to achieve optimal treatment goals.
Collapse
Affiliation(s)
- Jieni Li
- Department of Pharmaceutical Health Outcomes and Policy, College of Pharmacy, University of Houston, TX, USA
| | - Yinan Huang
- Department of Pharmacy Administration, College of Pharmacy, University of Mississippi, Oxford, MS, USA
| | | | - Rajender R Aparasu
- Department of Pharmaceutical Health Outcomes and Policy, College of Pharmacy, University of Houston, TX, USA
| |
Collapse
|
14
|
Machine Learning-Based Integration of Metabolomics Characterisation Predicts Progression of Myopic Retinopathy in Children and Adolescents. Metabolites 2023; 13:metabo13020301. [PMID: 36837920 PMCID: PMC9965721 DOI: 10.3390/metabo13020301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Revised: 02/11/2023] [Accepted: 02/16/2023] [Indexed: 02/22/2023] Open
Abstract
Myopic retinopathy is an important cause of irreversible vision loss and blindness. As metabolomics has recently been successfully applied in myopia research, this study sought to characterize the serum metabolic profile of myopic retinopathy in children and adolescents (4-18 years) and to develop a diagnostic model that combines clinical and metabolic features. We selected clinical and serum metabolic data from children and adolescents at different time points as the training set (n = 516) and the validation set (n = 60). All participants underwent an ophthalmologic examination. Untargeted metabolomics analysis of serum was performed. Three machine learning (ML) models were trained by combining metabolic features and conventional clinical factors that were screened for significance in discrimination. The better-performing model was validated in an independent point-in-time cohort and risk nomograms were developed. Retinopathy was present in 34.2% of participants (n = 185) in the training set, including 109 (28.61%) with mild to moderate myopia. A total of 27 metabolites showed significant variation between groups. After combining Lasso and random forest (RF), 12 modelled metabolites (mainly those involved in energy metabolism) were screened. Both the logistic regression and extreme Gradient Boosting (XGBoost) algorithms showed good discriminatory ability. In the time-validation cohort, logistic regression (AUC 0.842, 95% CI 0.724-0.96) and XGBoost (AUC 0.897, 95% CI 0.807-0.986) also showed good prediction accuracy and had well-fitted calibration curves. Three clinical characteristic coefficients remained significant in the multivariate joint model (p < 0.05), as did 8/12 metabolic characteristic coefficients. Myopic retinopathy may have abnormal energy metabolism. Machine learning models based on metabolic profiles and clinical data demonstrate good predictive performance and facilitate the development of individual interventions for myopia in children and adolescents.
Collapse
|
15
|
Ha CSR, Müller-Nurasyid M, Petrera A, Hauck SM, Marini F, Bartsch DK, Slater EP, Strauch K. Proteomics biomarker discovery for individualized prevention of familial pancreatic cancer using statistical learning. PLoS One 2023; 18:e0280399. [PMID: 36701413 PMCID: PMC9879447 DOI: 10.1371/journal.pone.0280399] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 12/28/2022] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND The low five-year survival rate of pancreatic ductal adenocarcinoma (PDAC) and the low diagnostic rate of early-stage PDAC via imaging highlight the need to discover novel biomarkers and improve the current screening procedures for early diagnosis. Familial pancreatic cancer (FPC) describes the cases of PDAC that are present in two or more individuals within a circle of first-degree relatives. Using innovative high-throughput proteomics, we were able to quantify the protein profiles of individuals at risk from FPC families in different potential pre-cancer stages. However, the high-dimensional proteomics data structure challenges the use of traditional statistical analysis tools. Hence, we applied advanced statistical learning methods to enhance the analysis and improve the results' interpretability. METHODS We applied model-based gradient boosting and adaptive lasso to deal with the small, unbalanced study design via simultaneous variable selection and model fitting. In addition, we used stability selection to identify a stable subset of selected biomarkers and, as a result, obtain even more interpretable results. In each step, we compared the performance of the different analytical pipelines and validated our approaches via simulation scenarios. RESULTS In the simulation study, model-based gradient boosting showed a more accurate prediction performance in the small, unbalanced, and high-dimensional datasets than adaptive lasso and could identify more relevant variables. Furthermore, using model-based gradient boosting, we discovered a subset of promising serum biomarkers that may potentially improve the current screening procedure of FPC. CONCLUSION Advanced statistical learning methods helped us overcome the shortcomings of an unbalanced study design in a valuable clinical dataset. The discovered serum biomarkers provide us with a clear direction for further investigations and more precise clinical hypotheses regarding the development of FPC and optimal strategies for its early detection.
Collapse
Affiliation(s)
- Chung Shing Rex Ha
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
- Institute of Genetic Epidemiology, Helmholtz Zentrum München—German Research Center for Environmental Health, Neuherberg, Germany
- Faculty of Medicine, Institute for Medical Information Processing, Chair of Genetic Epidemiology, Biometry, and Epidemiology (IBE), LMU Munich, Munich, Germany
- * E-mail:
| | - Martina Müller-Nurasyid
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
- Institute of Genetic Epidemiology, Helmholtz Zentrum München—German Research Center for Environmental Health, Neuherberg, Germany
- Faculty of Medicine, Institute for Medical Information Processing, Biometry, and Epidemiology (IBE), LMU Munich, Munich, Germanys
- Faculty of Medicine, Institute for Medical Information Processing, Pettenkofer School of Public Health Munich, Biometry, and Epidemiology (IBE), LMU Munich, Munich, Germany
| | - Agnese Petrera
- Research Unit Protein Science and Metabolomics and Proteomics Core Facility, Helmholtz Zentrum München—German Research Center for Environmental Health, Neuherberg, Germany
| | - Stefanie M. Hauck
- Research Unit Protein Science and Metabolomics and Proteomics Core Facility, Helmholtz Zentrum München—German Research Center for Environmental Health, Neuherberg, Germany
| | - Federico Marini
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
- Research Center for Immunotherapy (FZI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
| | - Detlef K. Bartsch
- Department of Visceral-, Thoracic- and Vascular Surgery, Philipps University, Marburg, Germany
| | - Emily P. Slater
- Department of Visceral-, Thoracic- and Vascular Surgery, Philipps University, Marburg, Germany
| | - Konstantin Strauch
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
- Institute of Genetic Epidemiology, Helmholtz Zentrum München—German Research Center for Environmental Health, Neuherberg, Germany
- Faculty of Medicine, Institute for Medical Information Processing, Chair of Genetic Epidemiology, Biometry, and Epidemiology (IBE), LMU Munich, Munich, Germany
| |
Collapse
|
16
|
Blood Plasma Metabolome Profiling at Different Stages of Renal Cell Carcinoma. Cancers (Basel) 2022; 15:cancers15010140. [PMID: 36612136 PMCID: PMC9818272 DOI: 10.3390/cancers15010140] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Revised: 12/23/2022] [Accepted: 12/24/2022] [Indexed: 12/28/2022] Open
Abstract
Early diagnostics significantly improves the survival of patients with renal cell carcinoma (RCC), which is the prevailing type of adult kidney cancer. However, the absence of clinically obvious symptoms and effective screening strategies at the early stages result to disease progression and survival rate reducing. The study was focused on revealing of potential low molecular biomarkers for early-stage RCC. The untargeted direct injection mass spectrometry-based metabolite profiling of blood plasma samples from 51 non-cancer volunteers (control) and 78 patients with different RCC subtypes and stages (early stages of clear cell RCC (ccRCC), papillary RCC (pRCC), chromophobe RCC (chrRCC) and advanced stages of ccRCC) was performed. Comparative analysis of the blood plasma metabolites between the control and cancer groups provided the detection of metabolites associated with different tumor stages. The designed model based on the revealed metabolites demonstrated high diagnostic power and accuracy. Overall, using the metabolomics approach the study revealed the metabolites demonstrating a high value for design of plasma-based test to improve early ccRCC diagnosis.
Collapse
|
17
|
A Proposed Framework for Early Prediction of Schistosomiasis. Diagnostics (Basel) 2022; 12:diagnostics12123138. [PMID: 36553145 PMCID: PMC9777618 DOI: 10.3390/diagnostics12123138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 12/08/2022] [Accepted: 12/08/2022] [Indexed: 12/15/2022] Open
Abstract
Schistosomiasis is a neglected tropical disease that continues to be a leading cause of illness and mortality around the globe. The causing parasites are affixed to the skin through defiled water and enter the human body. Failure to diagnose Schistosomiasis can result in various medical complications, such as ascites, portal hypertension, esophageal varices, splenomegaly, and growth retardation. Early prediction and identification of risk factors may aid in treating disease before it becomes incurable. We aimed to create a framework by incorporating the most significant features to predict Schistosomiasis using machine learning techniques. A dataset of advanced Schistosomiasis has been employed containing recovery and death cases. A total data of 4316 individuals containing recovery and death cases were included in this research. The dataset contains demographics, socioeconomic, and clinical factors with lab reports. Data preprocessing techniques (missing values imputation, outlier removal, data normalisation, and data transformation) have also been employed for better results. Feature selection techniques, including correlation-based feature selection, Information gain, gain ratio, ReliefF, and OneR, have been utilised to minimise a large number of features. Data resampling algorithms, including Random undersampling, Random oversampling, Cluster Centroid, Near miss, and SMOTE, are applied to address the data imbalance problem. We applied four machine learning algorithms to construct the model: Gradient Boosting, Light Gradient Boosting, Extreme Gradient Boosting and CatBoost. The performance of the proposed framework has been evaluated based on Accuracy, Precision, Recall and F1-Score. The results of our proposed framework stated that the CatBoost model showed the best performance with the highest accuracy of (87.1%) compared with Gradient Boosting (86%), Light Gradient Boosting (86.7%) and Extreme Gradient Boosting (86.9%). Our proposed framework will assist doctors and healthcare professionals in the early diagnosis of Schistosomiasis.
Collapse
|
18
|
Sharma M, Kumar N. Improved hepatocellular carcinoma fatality prognosis using ensemble learning approach. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING 2022; 13:5763-5777. [DOI: 10.1007/s12652-021-03256-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2020] [Accepted: 03/29/2021] [Indexed: 01/04/2025]
|
19
|
Retrieval and Assessment of Significant Wave Height from CYGNSS Mission Using Neural Network. REMOTE SENSING 2022. [DOI: 10.3390/rs14153666] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
In this study, we investigate sea state estimation from spaceborne GNSS-R. Due to the complex scattering of electromagnetic waves on the rough sea surface, the neural network approach is adopted to develop an algorithm to derive significant wave height (SWH) from CYGNSS data. Eighty-nine million pieces of CYGNSS data from September to November 2020 and the co-located ECMWF data are employed to train a three-hidden-layer neural network. Ten variables are considered as the input parameters of the neural network. Without the auxiliary of the wind speed, the SWH retrieved using the trained neural network exhibits a bias and an RMSE of −0.13 and 0.59 m with respect to ECMWF data. When considering wind speed as the input, the bias and RMSE were reduced to −0.09 and 0.49 m, respectively. When the incidence angle ranges from 35∘ to 65∘ and the SNR is above 7 dB, the retrieval performance is better than that obtained using other values. The measurements derived from the “Block III” satellite offer worse results than those derived from other satellites. When the distance is considered as an input parameter, the retrieval performances for the areas near the coast are significantly improved. A soft data filter is used to synchronously improve the precision and ensure the desired sample number. The RMSEs of the retrieved SWH are reduced to 0.45 m and 0.41 m from 0.59 m and 0.49 m, and only 16.0% and 14.9% of the samples are removed. The retrieved SWH also shows a clear agreement with the co-located buoy and Jason-3 altimeter data.
Collapse
|
20
|
Salvetat N, Checa-Robles FJ, Patel V, Cayzac C, Dubuc B, Chimienti F, Abraham JD, Dupré P, Vetter D, Méreuze S, Lang JP, Kupfer DJ, Courtet P, Weissmann D. A game changer for bipolar disorder diagnosis using RNA editing-based biomarkers. Transl Psychiatry 2022; 12:182. [PMID: 35504874 PMCID: PMC9064541 DOI: 10.1038/s41398-022-01938-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Revised: 03/30/2022] [Accepted: 04/19/2022] [Indexed: 11/08/2022] Open
Abstract
In clinical practice, differentiating Bipolar Disorder (BD) from unipolar depression is a challenge due to the depressive symptoms, which are the core presentations of both disorders. This misdiagnosis during depressive episodes results in a delay in proper treatment and a poor management of their condition. In a first step, using A-to-I RNA editome analysis, we discovered 646 variants (366 genes) differentially edited between depressed patients and healthy volunteers in a discovery cohort of 57 participants. After using stringent criteria and biological pathway analysis, candidate biomarkers from 8 genes were singled out and tested in a validation cohort of 410 participants. Combining the selected biomarkers with a machine learning approach achieved to discriminate depressed patients (n = 267) versus controls (n = 143) with an AUC of 0.930 (CI 95% [0.879-0.982]), a sensitivity of 84.0% and a specificity of 87.1%. In a second step by selecting among the depressed patients those with unipolar depression (n = 160) or BD (n = 95), we identified a combination of 6 biomarkers which allowed a differential diagnosis of bipolar disorder with an AUC of 0.935 and high specificity (Sp = 84.6%) and sensitivity (Se = 90.9%). The association of RNA editing variants modifications with depression subtypes and the use of artificial intelligence allowed developing a new tool to identify, among depressed patients, those suffering from BD. This test will help to reduce the misdiagnosis delay of bipolar patients, leading to an earlier implementation of a proper treatment.
Collapse
Affiliation(s)
- Nicolas Salvetat
- ALCEDIAG/Sys2Diag, CNRS UMR 9005, Parc Euromédecine, Montpellier, France
| | | | - Vipul Patel
- ALCEDIAG/Sys2Diag, CNRS UMR 9005, Parc Euromédecine, Montpellier, France
| | - Christopher Cayzac
- ALCEDIAG/Sys2Diag, CNRS UMR 9005, Parc Euromédecine, Montpellier, France
| | - Benjamin Dubuc
- ALCEDIAG/Sys2Diag, CNRS UMR 9005, Parc Euromédecine, Montpellier, France
| | - Fabrice Chimienti
- ALCEDIAG/Sys2Diag, CNRS UMR 9005, Parc Euromédecine, Montpellier, France
| | | | - Pierrick Dupré
- ALCEDIAG/Sys2Diag, CNRS UMR 9005, Parc Euromédecine, Montpellier, France
| | - Diana Vetter
- ALCEDIAG/Sys2Diag, CNRS UMR 9005, Parc Euromédecine, Montpellier, France
| | - Sandie Méreuze
- ALCEDIAG/Sys2Diag, CNRS UMR 9005, Parc Euromédecine, Montpellier, France
| | - Jean-Philippe Lang
- ALCEDIAG/Sys2Diag, CNRS UMR 9005, Parc Euromédecine, Montpellier, France
- Les Toises. Center for Psychiatry and Psychotherapy, Lausanne, Switzerland
| | - David J Kupfer
- Department of Psychiatry, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | - Philippe Courtet
- Department of Psychiatric Emergency & Acute Care, Lapeyronie Hospital, CHU Montpellier, Montpellier, France
| | - Dinah Weissmann
- ALCEDIAG/Sys2Diag, CNRS UMR 9005, Parc Euromédecine, Montpellier, France.
| |
Collapse
|
21
|
Chou EP, Yang SP. A virtual multi-label approach to imbalanced data classification. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2049820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Elizabeth P. Chou
- Department of Statistics, National Chengchi University, Taipei, Taiwan
| | - Shan-Ping Yang
- Department of Statistics, National Chengchi University, Taipei, Taiwan
| |
Collapse
|
22
|
Mojiri A, Khalili A, Zeinal Hamadani A. New hard-thresholding rules based on data splitting in high-dimensional imbalanced classification. Electron J Stat 2022. [DOI: 10.1214/21-ejs1939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Arezou Mojiri
- Department of Mathematical Sciences, Isfahan University of Technology, Isfahan, 19395-5746, Iran
| | - Abbas Khalili
- Department of Mathematics and Statistics, McGill University, Montreal, H3A 0B9, Canada
| | - Ali Zeinal Hamadani
- Department of Industrial and Systems Engineering, Isfahan University of Technology, Isfahan, 19395-5746, Iran
| |
Collapse
|
23
|
Pes B, Lai G. Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study. PeerJ Comput Sci 2021; 7:e832. [PMID: 35036539 PMCID: PMC8725666 DOI: 10.7717/peerj-cs.832] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Accepted: 12/06/2021] [Indexed: 05/28/2023]
Abstract
High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions.
Collapse
Affiliation(s)
- Barbara Pes
- Dipartimento di Matematica e Informatica, Università degli Studi di Cagliari, Cagliari, Italy
| | - Giuseppina Lai
- Dipartimento di Matematica e Informatica, Università degli Studi di Cagliari, Cagliari, Italy
| |
Collapse
|
24
|
Šinkovec H, Heinze G, Blagus R, Geroldinger A. To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets. BMC Med Res Methodol 2021; 21:199. [PMID: 34592945 PMCID: PMC8482588 DOI: 10.1186/s12874-021-01374-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Accepted: 08/19/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation. There is evidence, however, that ridge logistic regression can result in highly variable calibration slopes in small or sparse data situations. METHODS In this paper, we elaborate this issue further by performing a comprehensive simulation study, investigating the performance of ridge logistic regression in terms of coefficients and predictions and comparing it to Firth's correction that has been shown to perform well in low-dimensional settings. In addition to tuned ridge regression where the penalty strength is estimated from the data by minimizing some measure of the out-of-sample prediction error or information criterion, we also considered ridge regression with pre-specified degree of shrinkage. We included 'oracle' models in the simulation study in which the complexity parameter was chosen based on the true event probabilities (prediction oracle) or regression coefficients (explanation oracle) to demonstrate the capability of ridge regression if truth was known. RESULTS Performance of ridge regression strongly depends on the choice of complexity parameter. As shown in our simulation and illustrated by a data example, values optimized in small or sparse datasets are negatively correlated with optimal values and suffer from substantial variability which translates into large MSE of coefficients and large variability of calibration slopes. In contrast, in our simulations pre-specifying the degree of shrinkage prior to fitting led to accurate coefficients and predictions even in non-ideal settings such as encountered in the context of rare outcomes or sparse predictors. CONCLUSIONS Applying tuned ridge regression in small or sparse datasets is problematic as it results in unstable coefficients and predictions. In contrast, determining the degree of shrinkage according to some meaningful prior assumptions about true effects has the potential to reduce bias and stabilize the estimates.
Collapse
Affiliation(s)
- Hana Šinkovec
- Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Spitalgasse 23, 1090, Vienna, Austria
| | - Georg Heinze
- Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Spitalgasse 23, 1090, Vienna, Austria
| | - Rok Blagus
- Institute for Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Angelika Geroldinger
- Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Spitalgasse 23, 1090, Vienna, Austria.
| |
Collapse
|
25
|
Lee SY, Lee ST, Suh S, Ko BJ, Oh HB. Revealing Unknown Controlled Substances and New Psychoactive Substances Using High-Resolution LC-MS/MS Machine Learning Models and the Hybrid Similarity Search Algorithm. J Anal Toxicol 2021; 46:732-742. [PMID: 34498039 DOI: 10.1093/jat/bkab098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2021] [Revised: 08/11/2021] [Accepted: 09/08/2021] [Indexed: 11/12/2022] Open
Abstract
High-resolution LC-MS/MS tandem mass spectra-based machine learning models are constructed to address the analytical challenge of identifying unknown controlled substances and new psychoactive substances (NPS's). Using a training set comprised of 770 LC-MS/MS barcode spectra (with binary entries 0 or 1) obtained generally by high-resolution mass spectrometers, three classification machine learning models were generated and evaluated. The three models are artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbor (k-NN) models. In these models, controlled substances and NPS's were classified into 13 subgroups (benzylpiperazine, opiate, benzodiazepine, amphetamine, cocaine, methcathinone, classical cannabinoid, fentanyl, 2C series, indazole carbonyl compound, indole carbonyl compound, phencyclidine, and others). Using 193 LC-MS/MS barcode spectra as an external test set, accuracy of the ANN, SVM, and k-NN models were evaluated as 72.5%, 90.0%, and 94.3%, respectively. Also, the hybrid similarity search (HSS) algorithm was evaluated to examine whether this algorithm can successfully identify unknown controlled substances and NPS's whose data are unavailable in the database. When only 24 representative LC-MS/MS spectra of controlled substances and NPS's were selectively included in the database, it was found that HSS can successfully identify compounds with high reliability. The machine learning models and HSS algorithms are incorporated into our home-coded AI-SNPS (artificial intelligence screener for narcotic drugs and psychotropic substances) standalone software that is equipped with a graphic user interface. The use of this software allows unknown controlled substances and NPS's to be identified in a convenient manner.
Collapse
Affiliation(s)
- So Yeon Lee
- Department of Chemistry, Sogang University, Seoul 04107, Republic of Korea
| | - Sang Tak Lee
- Department of Chemistry, Sogang University, Seoul 04107, Republic of Korea
| | - Sungill Suh
- Forensic genetics & chemistry division, Supreme prosecutors' office, Seoul 06590, Republic of Korea
| | - Bum Jun Ko
- Forensic genetics & chemistry division, Supreme prosecutors' office, Seoul 06590, Republic of Korea
| | - Han Bin Oh
- Department of Chemistry, Sogang University, Seoul 04107, Republic of Korea
| |
Collapse
|
26
|
Mohammed M, Mwambi H, Mboya IB, Elbashir MK, Omolo B. A stacking ensemble deep learning approach to cancer type classification based on TCGA data. Sci Rep 2021; 11:15626. [PMID: 34341396 PMCID: PMC8329290 DOI: 10.1038/s41598-021-95128-x] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 07/19/2021] [Indexed: 12/13/2022] Open
Abstract
Cancer tumor classification based on morphological characteristics alone has been shown to have serious limitations. Breast, lung, colorectal, thyroid, and ovarian are the most commonly diagnosed cancers among women. Precise classification of cancers into their types is considered a vital problem for cancer diagnosis and therapy. In this paper, we proposed a stacking ensemble deep learning model based on one-dimensional convolutional neural network (1D-CNN) to perform a multi-class classification on the five common cancers among women based on RNASeq data. The RNASeq gene expression data was downloaded from Pan-Cancer Atlas using GDCquery function of the TCGAbiolinks package in the R software. We used least absolute shrinkage and selection operator (LASSO) as feature selection method. We compared the results of the new proposed model with and without LASSO with the results of the single 1D-CNN and machine learning methods which include support vector machines with radial basis function, linear, and polynomial kernels; artificial neural networks; k-nearest neighbors; bagging trees. The results show that the proposed model with and without LASSO has a better performance compared to other classifiers. Also, the results show that the machine learning methods (SVM-R, SVM-L, SVM-P, ANN, KNN, and bagging trees) with under-sampling have better performance than with over-sampling techniques. This is supported by the statistical significance test of accuracy where the p-values for differences between the SVM-R and SVM-P, SVM-R and ANN, SVM-R and KNN are found to be p = 0.003, p = < 0.001, and p = < 0.001, respectively. Also, SVM-L had a significant difference compared to ANN p = 0.009. Moreover, SVM-P and ANN, SVM-P and KNN are found to be significantly different with p-values p = < 0.001 and p = < 0.001, respectively. In addition, ANN and bagging trees, ANN and KNN were found to be significantly different with p-values p = < 0.001 and p = 0.004, respectively. Thus, the proposed model can help in the early detection and diagnosis of cancer in women, and hence aid in designing early treatment strategies to improve survival.
Collapse
Affiliation(s)
- Mohanad Mohammed
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa.
| | - Henry Mwambi
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa
| | - Innocent B Mboya
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa
- Department of Epidemiology and Biostatistics, Kilimanjaro Christian Medical University College (KCMUCo), P. O. Box 2240, Moshi, Tanzania
| | - Murtada K Elbashir
- College of Computer and Information Sciences, Jouf University, Sakaka, 72441, Saudi Arabia
- Faculty of Mathematical and Computer Sciences, University of Gezira, Wad Madani, 11123, Sudan
| | - Bernard Omolo
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa
- Division of Mathematics and Computer Science, University of South Carolina-Upstate, 800 University Way, Spartanburg, USA
- School of Public Health, Faculty of Health Sciences, University of Witwatersrand, Johannesburg, South Africa
| |
Collapse
|
27
|
Abstract
Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the Random Forest, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone.
Collapse
|
28
|
Xu J, Chen Z, Zhang J, Lu Y, Yang X, Pumir A. Realistic preterm prediction based on optimized synthetic sampling of EHG signal. Comput Biol Med 2021; 136:104644. [PMID: 34271407 DOI: 10.1016/j.compbiomed.2021.104644] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2021] [Revised: 07/07/2021] [Accepted: 07/07/2021] [Indexed: 01/28/2023]
Abstract
Preterm labor is the leading cause of neonatal morbidity and mortality in newborns and has attracted significant research attention from many scientific areas. The relationship between uterine contraction and the underlying electrical activities makes uterine electrohysterogram (EHG) a promising direction for detecting and predicting preterm births. However, due to the scarcity of EHG signals, especially those leading to preterm births, synthetic algorithms have been used to generate artificial samples of preterm birth type in order to eliminate bias in the prediction towards normal delivery, at the expense of reducing the feature effectiveness in automatic preterm detection based on machine learning. To address this problem, we quantify the effect of synthetic samples (balance coefficient) on the effectiveness of features and form a general performance metric by using several feature scores with relevant weights that describe their contributions to class segregation. In combination with the activation/inactivation functions that characterize the effect of the abundance of training samples on the accuracy of the prediction of preterm and normal birth delivery, we obtained an optimal sample balance coefficient that compromises the effect of synthetic samples in removing bias toward the majority group (i.e., normal delivery and the side effect of reducing the importance of features). A more realistic predictive accuracy was achieved through a series of numerical tests on the publicly available TPEHG database, therefore demonstrating the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Jinshan Xu
- College of Computer Science, Zhejiang University of Technology, Hangzhou, 310023, China; Research Center for AI Social Experiment, Zhejiang Lab, Hangzhou, 311321, China
| | - Zhenqin Chen
- College of Computer Science, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Jinpeng Zhang
- College of Computer Science, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Yanpei Lu
- College of Computer Science, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Xi Yang
- College of Computer Science, Zhejiang University of Technology, Hangzhou, 310023, China.
| | - Alain Pumir
- Laboratoire de Physique, ENS-Lyon, Lyon, 69007, France
| |
Collapse
|
29
|
Improving Imbalanced Land Cover Classification with K-Means SMOTE: Detecting and Oversampling Distinctive Minority Spectral Signatures. INFORMATION 2021. [DOI: 10.3390/info12070266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Land cover maps are a critical tool to support informed policy development, planning, and resource management decisions. With significant upsides, the automatic production of Land Use/Land Cover maps has been a topic of interest for the remote sensing community for several years, but it is still fraught with technical challenges. One such challenge is the imbalanced nature of most remotely sensed data. The asymmetric class distribution impacts negatively the performance of classifiers and adds a new source of error to the production of these maps. In this paper, we address the imbalanced learning problem, by using K-means and the Synthetic Minority Oversampling Technique (SMOTE) as an improved oversampling algorithm. K-means SMOTE improves the quality of newly created artificial data by addressing both the between-class imbalance, as traditional oversamplers do, but also the within-class imbalance, avoiding the generation of noisy data while effectively overcoming data imbalance. The performance of K-means SMOTE is compared to three popular oversampling methods (Random Oversampling, SMOTE and Borderline-SMOTE) using seven remote sensing benchmark datasets, three classifiers (Logistic Regression, K-Nearest Neighbors and Random Forest Classifier) and three evaluation metrics using a five-fold cross-validation approach with three different initialization seeds. The statistical analysis of the results show that the proposed method consistently outperforms the remaining oversamplers producing higher quality land cover classifications. These results suggest that LULC data can benefit significantly from the use of more sophisticated oversamplers as spectral signatures for the same class can vary according to geographical distribution.
Collapse
|
30
|
Ullah Z, Saleem F, Jamjoom M, Fakieh B. Reliable Prediction Models Based on Enriched Data for Identifying the Mode of Childbirth by Using Machine Learning Methods: Development Study. J Med Internet Res 2021; 23:e28856. [PMID: 34085938 PMCID: PMC8214183 DOI: 10.2196/28856] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 03/30/2021] [Accepted: 04/30/2021] [Indexed: 11/30/2022] Open
Abstract
Background The use of artificial intelligence has revolutionized every area of life such as business and trade, social and electronic media, education and learning, manufacturing industries, medicine and sciences, and every other sector. The new reforms and advanced technologies of artificial intelligence have enabled data analysts to transmute raw data generated by these sectors into meaningful insights for an effective decision-making process. Health care is one of the integral sectors where a large amount of data is generated daily, and making effective decisions based on these data is therefore a challenge. In this study, cases related to childbirth either by the traditional method of vaginal delivery or cesarean delivery were investigated. Cesarean delivery is performed to save both the mother and the fetus when complications related to vaginal birth arise. Objective The aim of this study was to develop reliable prediction models for a maternity care decision support system to predict the mode of delivery before childbirth. Methods This study was conducted in 2 parts for identifying the mode of childbirth: first, the existing data set was enriched and second, previous medical records about the mode of delivery were investigated using machine learning algorithms and by extracting meaningful insights from unseen cases. Several prediction models were trained to achieve this objective, such as decision tree, random forest, AdaBoostM1, bagging, and k-nearest neighbor, based on original and enriched data sets. Results The prediction models based on enriched data performed well in terms of accuracy, sensitivity, specificity, F-measure, and receiver operating characteristic curves in the outcomes. Specifically, the accuracy of k-nearest neighbor was 84.38%, that of bagging was 83.75%, that of random forest was 83.13%, that of decision tree was 81.25%, and that of AdaBoostM1 was 80.63%. Enrichment of the data set had a good impact on improving the accuracy of the prediction process, which supports maternity care practitioners in making decisions in critical cases. Conclusions Our study shows that enriching the data set improves the accuracy of the prediction process, thereby supporting maternity care practitioners in making informed decisions in critical cases. The enriched data set used in this study yields good results, but this data set can become even better if the records are increased with real clinical data.
Collapse
Affiliation(s)
- Zahid Ullah
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Farrukh Saleem
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mona Jamjoom
- Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Bahjat Fakieh
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
31
|
Prediction model of random forest for the risk of hyperuricemia in a Chinese basic health checkup test. Biosci Rep 2021; 41:228123. [PMID: 33749777 PMCID: PMC8026814 DOI: 10.1042/bsr20203859] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Revised: 03/08/2021] [Accepted: 03/18/2021] [Indexed: 02/04/2023] Open
Abstract
Objectives: The present study aimed to develop a random forest (RF) based prediction model for hyperuricemia (HUA) and compare its performance with the conventional logistic regression (LR) model. Methods: This cross-sectional study recruited 91,690 participants (14,032 with HUA, 77,658 without HUA). We constructed a RF-based prediction model in the training sets and evaluated it in the validation sets. Performance of the RF model was compared with the LR model by receiver operating characteristic (ROC) curve analysis. Results: The sensitivity and specificity of the RF models were 0.702 and 0.650 in males, 0.767 and 0.721 in females. The positive predictive value (PPV) and negative predictive value (NPV) were 0.372 and 0.881 in males, 0.159 and 0.978 in females. AUC of the RF models was 0.739 (0.728–0.750) in males and 0.818 (0.799–0.837) in females. AUC of the LR models were 0.730 (0.718–0.741) for males and 0.815 (0.795–0.835) for females. The predictive power of RF was slightly higher than that of LR, but was not statistically significant in females (Delong tests, P=0.0015 for males, P=0.5415 for females). Conclusion: Compared with LR, the good performance in HUA status prediction and the tolerance of features associations or interactions showed great potential of RF in further application. A prospective cohort is necessary for HUA developing prediction. People with high risk factors should be encouraged to actively control to reduce the probability of developing HUA.
Collapse
|
32
|
Seifert S, Gundlach S, Junge O, Szymczak S. Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study. Bioinformatics 2021; 36:4301-4308. [PMID: 32399562 PMCID: PMC7520048 DOI: 10.1093/bioinformatics/btaa483] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Revised: 03/13/2020] [Accepted: 05/05/2020] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. RESULTS The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. AVAILABILITY AND IMPLEMENTATION An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stephan Seifert
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| | - Sven Gundlach
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| | - Olaf Junge
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| | - Silke Szymczak
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| |
Collapse
|
33
|
Kern F, Krammes L, Danz K, Diener C, Kehl T, Küchler O, Fehlmann T, Kahraman M, Rheinheimer S, Aparicio-Puerta E, Wagner S, Ludwig N, Backes C, Lenhof HP, von Briesen H, Hart M, Keller A, Meese E. Validation of human microRNA target pathways enables evaluation of target prediction tools. Nucleic Acids Res 2021; 49:127-144. [PMID: 33305319 PMCID: PMC7797041 DOI: 10.1093/nar/gkaa1161] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 10/20/2020] [Accepted: 11/13/2020] [Indexed: 12/17/2022] Open
Abstract
MicroRNAs are regulators of gene expression. A wide-spread, yet not validated, assumption is that the targetome of miRNAs is non-randomly distributed across the transcriptome and that targets share functional pathways. We developed a computational and experimental strategy termed high-throughput miRNA interaction reporter assay (HiTmIR) to facilitate the validation of target pathways. First, targets and target pathways are predicted and prioritized by computational means to increase the specificity and positive predictive value. Second, the novel webtool miRTaH facilitates guided designs of reporter assay constructs at scale. Third, automated and standardized reporter assays are performed. We evaluated HiTmIR using miR-34a-5p, for which TNF- and TGFB-signaling, and Parkinson's Disease (PD)-related categories were identified and repeated the pipeline for miR-7-5p. HiTmIR validated 58.9% of the target genes for miR-34a-5p and 46.7% for miR-7-5p. We confirmed the targeting by measuring the endogenous protein levels of targets in a neuronal cell model. The standardized positive and negative targets are collected in the new miRATBase database, representing a resource for training, or benchmarking new target predictors. Applied to 88 target predictors with different confidence scores, TargetScan 7.2 and miRanda outperformed other tools. Our experiments demonstrate the efficiency of HiTmIR and provide evidence for an orchestrated miRNA-gene targeting.
Collapse
Affiliation(s)
- Fabian Kern
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
| | - Lena Krammes
- Institute of Human Genetics, Saarland University, 66421 Homburg, Germany
| | - Karin Danz
- Department of Bioprocessing & Bioanalytics, Fraunhofer Institute for Biomedical Engineering, 66280 Sulzbach, Germany
| | - Caroline Diener
- Institute of Human Genetics, Saarland University, 66421 Homburg, Germany
| | - Tim Kehl
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Oliver Küchler
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
| | - Tobias Fehlmann
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
| | - Mustafa Kahraman
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
| | | | - Ernesto Aparicio-Puerta
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Department of Genetics, Faculty of Science, University of Granada, 18071 Granada, Spain.,Instituto de Investigación Biosanitaria ibs. Granada, University of Granada, 18071 Granada, Spain
| | - Sylvia Wagner
- Department of Bioprocessing & Bioanalytics, Fraunhofer Institute for Biomedical Engineering, 66280 Sulzbach, Germany
| | - Nicole Ludwig
- Institute of Human Genetics, Saarland University, 66421 Homburg, Germany.,Center of Human and Molecular Biology, Saarland University, 66123 Saarbrücken, Germany
| | - Christina Backes
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany
| | - Hans-Peter Lenhof
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Hagen von Briesen
- Department of Bioprocessing & Bioanalytics, Fraunhofer Institute for Biomedical Engineering, 66280 Sulzbach, Germany
| | - Martin Hart
- Institute of Human Genetics, Saarland University, 66421 Homburg, Germany
| | - Andreas Keller
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.,Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany.,Department of Neurology and Neurological Sciences, Stanford University School of Medicine, Stanford, CA, USA
| | - Eckart Meese
- Institute of Human Genetics, Saarland University, 66421 Homburg, Germany
| |
Collapse
|
34
|
Nezu N, Usui Y, Saito A, Shimizu H, Asakage M, Yamakawa N, Tsubota K, Wakabayashi Y, Narimatsu A, Umazume K, Maruyama K, Sugimoto M, Kuroda M, Goto H. Machine Learning Approach for Intraocular Disease Prediction Based on Aqueous Humor Immune Mediator Profiles. Ophthalmology 2021; 128:1197-1208. [PMID: 33484732 DOI: 10.1016/j.ophtha.2021.01.019] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Revised: 12/20/2020] [Accepted: 01/14/2021] [Indexed: 01/02/2023] Open
Abstract
PURPOSE Various immune mediators have crucial roles in the pathogenesis of intraocular diseases. Machine learning can be used to automatically select and weigh various predictors to develop models maximizing predictive power. However, these techniques have not yet been applied extensively in studies focused on intraocular diseases. We evaluated whether 5 machine learning algorithms applied to the data of immune-mediator levels in aqueous humor can predict the actual diagnoses of 17 selected intraocular diseases and identified which immune mediators drive the predictive power of a machine learning model. DESIGN Cross-sectional study. PARTICIPANTS Five hundred twelve eyes with diagnoses from among 17 intraocular diseases. METHODS Aqueous humor samples were collected, and the concentrations of 28 immune mediators were determined using a cytometric bead array. Each immune mediator was ranked according to its importance using 5 machine learning algorithms. Stratified k-fold cross-validation was used in evaluation of algorithms with the dataset divided into training and test datasets. MAIN OUTCOME MEASURES The algorithms were evaluated in terms of precision, recall, accuracy, F-score, area under the receiver operating characteristic curve, area under the precision-recall curve, and mean decrease in Gini index. RESULTS Among the 5 machine learning models, random forest (RF) yielded the highest classification accuracy in multiclass differentiation of 17 intraocular diseases. The RF prediction models for vitreoretinal lymphoma, acute retinal necrosis, endophthalmitis, rhegmatogenous retinal detachment, and primary open-angle glaucoma achieved the highest classification accuracy, precision, and recall. Random forest recognized vitreoretinal lymphoma, acute retinal necrosis, endophthalmitis, rhegmatogenous retinal detachment, and primary open-angle glaucoma with the top 5 F-scores. The 3 highest-ranking relevant immune mediators were interleukin (IL)-10, interferon-γ-inducible protein (IP)-10, and angiogenin for prediction of vitreoretinal lymphoma; monokine induced by interferon γ, interferon γ, and IP-10 for acute retinal necrosis; and IL-6, granulocyte colony-stimulating factor, and IL-8 for endophthalmitis. CONCLUSIONS Random forest algorithms based on 28 immune mediators in aqueous humor successfully predicted the diagnosis of vitreoretinal lymphoma, acute retinal necrosis, and endophthalmitis. Overall, the findings of the present study contribute to increased knowledge on new biomarkers that potentially can facilitate diagnosis of intraocular diseases in the future.
Collapse
Affiliation(s)
- Naoya Nezu
- Department of Ophthalmology, Tokyo Medical University Hospital, Tokyo, Japan
| | - Yoshihiko Usui
- Department of Ophthalmology, Tokyo Medical University Hospital, Tokyo, Japan.
| | - Akira Saito
- Department of AI Applied Quantitative Clinical Science, Tokyo Medical University, Tokyo, Japan
| | - Hiroyuki Shimizu
- Department of Ophthalmology, Tokyo Medical University Hospital, Tokyo, Japan
| | - Masaki Asakage
- Department of Ophthalmology, Tokyo Medical University Hospital, Tokyo, Japan
| | - Naoyuki Yamakawa
- Department of Ophthalmology, Tokyo Medical University Hospital, Tokyo, Japan
| | - Kinya Tsubota
- Department of Ophthalmology, Tokyo Medical University Hospital, Tokyo, Japan
| | | | - Akitomo Narimatsu
- Department of Ophthalmology, Tokyo Medical University Hospital, Tokyo, Japan
| | - Kazuhiko Umazume
- Department of Ophthalmology, Tokyo Medical University Hospital, Tokyo, Japan
| | - Katsuhiko Maruyama
- Department of Ophthalmology, Tokyo Medical University Hospital, Tokyo, Japan
| | - Masahiro Sugimoto
- Research and Development Center for Minimally Invasive Therapies, Institute of Medical Science, Tokyo Medical University, Tokyo, Japan
| | - Masahiko Kuroda
- Department of Molecular Pathology, Tokyo Medical University, Tokyo, Japan
| | - Hiroshi Goto
- Department of Ophthalmology, Tokyo Medical University Hospital, Tokyo, Japan
| |
Collapse
|
35
|
Abd Rahman HA, Wah YB, Huat OS. Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate. PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY 2021; 29. [DOI: 10.47836/pjst.29.1.10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimates depended on sample size whereby for sample size 100, 500, 1000 – 2000 and 2500 – 3500, the estimates were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.
Collapse
|
36
|
Rahman HAA, Wah YB, Huat OS. Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate. PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY 2020; 28. [DOI: 10.47836/pjst.28.4.02] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using Mean Square Error (MSE). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimated depends on sample size whereby for sample size 100, 500, 1000 - 2000 and 2500 - 3500, the estimated were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.
Collapse
|
37
|
Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging. Int J Comput Assist Radiol Surg 2020; 15:2041-2048. [PMID: 32965624 DOI: 10.1007/s11548-020-02260-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Accepted: 09/04/2020] [Indexed: 10/23/2022]
Abstract
PURPOSE Machine learning (ML) algorithms are well known to exhibit variations in prediction accuracy when provided with imbalanced training sets typically seen in medical imaging (MI) due to the imbalanced ratio of pathological and normal cases. This paper presents a thorough investigation of the effects of class imbalance and methods for mitigating class imbalance in ML algorithms applied to MI. METHODS We first selected five classes from the Image Retrieval in Medical Applications (IRMA) dataset, performed multiclass classification using the random forest model (RFM), and then performed binary classification using convolutional neural network (CNN) on a chest X-ray dataset. An imbalanced class was created in the training set by varying the number of images in that class. Methods tested to mitigate class imbalance included oversampling, undersampling, and changing class weights of the RFM. Model performance was assessed by overall classification accuracy, overall F1 score, and specificity, recall, and precision of the imbalanced class. RESULTS A close-to-balanced training set resulted in the best model performance, and a large imbalance with overrepresentation was more detrimental to model performance than underrepresentation. Oversampling and undersampling methods were both effective in mitigating class imbalance, and efficacy of oversampling techniques was class specific. CONCLUSION This study systematically demonstrates the effect of class imbalance on two public X-ray datasets on RFM and CNN, making these findings widely applicable as a reference. Furthermore, the methods employed here can guide researchers in assessing and addressing the effects of class imbalance, while considering the data-specific characteristics to optimize imbalance mitigating methods.
Collapse
|
38
|
Diaz M, Panangadan A. Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses. 2020 IEEE 21ST INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE : IRI 2020 : PROCEEDINGS : VIRTUAL CONFERENCE, 11-13 AUGUST 2020. IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (21ST : 2... 2020; 2020:259-264. [PMID: 34853666 DOI: 10.1109/iri49571.2020.00044] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
There is increasing interest in automatically identifying advertisements related to sex trafficking in online review sites. The main challenge is to identify the changing patterns in text reviews that are used to indicate illegal businesses. This work describes a novel means of identifying illegal business advertisements using natural language processing and machine learning. The method relies on building a training set of reviews of known illegal businesses. This training data is created by integrating a small high precision set of known illegal businesses (Rubmaps) with a large collection of online reviews from a general purpose review site (Yelp). Standard natural language pre-processing techniques are then applied to the text reviews and converted into a bag-of-words model with Term frequency-inverse document weighting. The resulting Document-Term matrix is used to train a classifier and then to identify suspicious activity from the remaining reviews. This approach therefore leverages a high-precision, low-recall dataset to identify relevant instances from the large low-precision, high-recall dataset. The approach was evaluated on a collection of 456,050 reviews from the Yelp online forum with a variety of machine learning algorithms and different number of text features. The method achieved a f1-score of 0.77 with a random forests classifier. The number of text features could also be reduced from 1,473 to 447 for a compact classifier with only a small drop in accuracy.
Collapse
Affiliation(s)
- Maria Diaz
- Department of Computer Science, California State University, Fullerton, Fullerton, California 92831, USA
| | - Anand Panangadan
- Department of Computer Science, California State University, Fullerton, Fullerton, California 92831, USA
| |
Collapse
|
39
|
Franco J, Rajwa B, Ferreira CR, Sundberg JP, HogenEsch H. Lipidomic Profiling of the Epidermis in a Mouse Model of Dermatitis Reveals Sexual Dimorphism and Changes in Lipid Composition before the Onset of Clinical Disease. Metabolites 2020; 10:metabo10070299. [PMID: 32708296 PMCID: PMC7408197 DOI: 10.3390/metabo10070299] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Revised: 07/17/2020] [Accepted: 07/18/2020] [Indexed: 02/07/2023] Open
Abstract
Atopic dermatitis (AD) is a multifactorial disease associated with alterations in lipid composition and organization in the epidermis. Multiple variants of AD exist with different outcomes in response to therapies. The evaluation of disease progression and response to treatment are observational assessments with poor inter-observer agreement highlighting the need for molecular markers. SHARPIN-deficient mice (Sharpincpdm) spontaneously develop chronic proliferative dermatitis with features similar to AD in humans. To study the changes in the epidermal lipid-content during disease progression, we tested 72 epidermis samples from three groups (5-, 7-, and 10-weeks old) of cpdm mice and their WT littermates. An agnostic mass-spectrometry strategy for biomarker discovery termed multiple-reaction monitoring (MRM)-profiling was used to detect and monitor 1,030 lipid ions present in the epidermis samples. In order to select the most relevant ions, we utilized a two-tiered filter/wrapper feature-selection strategy. Lipid categories were compressed, and an elastic-net classifier was used to rank and identify the most predictive lipid categories for sex, phenotype, and disease stages of cpdm mice. The model accurately classified the samples based on phospholipids, cholesteryl esters, acylcarnitines, and sphingolipids, demonstrating that disease progression cannot be defined by one single lipid or lipid category.
Collapse
Affiliation(s)
- Jackeline Franco
- Department of Comparative Pathobiology, Purdue University, West Lafayette, IN 47907, USA;
| | - Bartek Rajwa
- Bindley Bioscience Center, Purdue University, West Lafayette, IN 47907, USA
- Correspondence: (B.R.); (H.H.)
| | - Christina R. Ferreira
- Metabolite Profiling Facility, Bindley Bioscience Center, Purdue University, West Lafayette, IN 47907, USA;
| | | | - Harm HogenEsch
- Department of Comparative Pathobiology, Purdue University, West Lafayette, IN 47907, USA;
- Purdue Institute of Inflammation, Immunology and Infectious Diseases, Purdue University, West Lafayette, IN 47907, USA
- Correspondence: (B.R.); (H.H.)
| |
Collapse
|
40
|
Chiang CH, Lee J, Wang C, Williams AJ, Lucas TH, Cohen YE, Viventi J. A modular high-density μECoG system on macaque vlPFC for auditory cognitive decoding. J Neural Eng 2020; 17:046008. [PMID: 32498058 DOI: 10.1088/1741-2552/ab9986] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
OBJECTIVE A fundamental goal of the auditory system is to parse the auditory environment into distinct perceptual representations. Auditory perception is mediated by the ventral auditory pathway, which includes the ventrolateral prefrontal cortex (vlPFC). Because large-scale recordings of auditory signals are quite rare, the spatiotemporal resolution of the neuronal code that underlies vlPFC's contribution to auditory perception has not been fully elucidated. Therefore, we developed a modular, chronic, high-resolution, multi-electrode array system with long-term viability in order to identify the information that could be decoded from μECoG vlPFC signals. APPROACH We molded three separate μECoG arrays into one and implanted this system in a non-human primate. A custom 3D-printed titanium chamber was mounted on the left hemisphere. The molded 294-contact μECoG array was implanted subdurally over the vlPFC. μECoG activity was recorded while the monkey participated in a 'hearing-in-noise' task in which they reported hearing a 'target' vocalization from a background 'chorus' of vocalizations. We titrated task difficulty by varying the sound level of the target vocalization, relative to the chorus (target-to-chorus ratio, TCr). MAIN RESULTS We decoded the TCr and the monkey's behavioral choices from the μECoG signal. We analyzed decoding accuracy as a function of number of electrodes, spatial resolution, and time from implantation. Over a one-year period, we found significant decoding with individual electrodes that increased significantly as we decoded simultaneously more electrodes. Further, we found that the decoding for behavioral choice was better than the decoding of TCr. Finally, because the decoding accuracy of individual electrodes varied on a day-by-day basis, electrode arrays with high channel counts ensure robust decoding in the long term. SIGNIFICANCE Our results demonstrate the utility of high-resolution and high-channel-count, chronic µECoG recording. We developed a surface electrode array that can be scaled to cover larger cortical areas without increasing the chamber footprint.
Collapse
Affiliation(s)
- Chia-Han Chiang
- Department of Biomedical Engineering, Duke University, Durham, NC, United States of America. These authors contributed equally to this work
| | | | | | | | | | | | | |
Collapse
|
41
|
Abdulrauf Sharifai G, Zainol Z. Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm. Genes (Basel) 2020; 11:genes11070717. [PMID: 32605144 PMCID: PMC7397300 DOI: 10.3390/genes11070717] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Revised: 12/19/2019] [Accepted: 01/07/2020] [Indexed: 11/16/2022] Open
Abstract
The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.
Collapse
Affiliation(s)
- Garba Abdulrauf Sharifai
- Department of Computer Sciences, Yusuf Maitama Sule University, 700222 Kofar Nassarawa, Kano, Nigeria
- School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Malaysia;
- Correspondence: ; Tel.: +60-111-317-0481 or +60-194-004-327
| | - Zurinahni Zainol
- School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Malaysia;
| |
Collapse
|
42
|
Fu GH, Wu YJ, Zong MJ, Pan J. Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data. BMC Bioinformatics 2020; 21:121. [PMID: 32293252 PMCID: PMC7092448 DOI: 10.1186/s12859-020-3411-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Accepted: 02/12/2020] [Indexed: 11/11/2022] Open
Abstract
Background Feature selection in class-imbalance learning has gained increasing attention in recent years due to the massive growth of high-dimensional class-imbalanced data across many scientific fields. In addition to reducing model complexity and discovering key biomarkers, feature selection is also an effective method of combating overlapping which may arise in such data and become a crucial aspect for determining classification performance. However, ordinary feature selection techniques for classification can not be simply used for addressing class-imbalanced data without any adjustment. Thus, more efficient feature selection technique must be developed for complicated class-imbalanced data, especially in the context of high-dimensionality. Results We proposed an algorithm called sssHD to achieve stable sparse feature selection applied it to complicated class-imbalanced data. sssHD is based on the Hellinger distance (HD) coupled with sparse regularization techniques. We stated that Hellinger distance is not only class-insensitive but also translation-invariant. Simulation result indicates that HD-based selection algorithm is effective in recognizing key features and control false discoveries for class-imbalance learning. Five gene expression datasets are also employed to test the performance of the sssHD algorithm, and a comparison with several existing selection procedures is performed. The result shows that sssHD is highly competitive in terms of five assessment metrics. In addition, sssHD presents limited differences between performing and not performing re-balance preprocessing. Conclusions sssHD is a practical feature selection method for high-dimensional class-imbalanced data, which is simple and can be an alternative for performing feature selection in class-imbalanced data. sssHD can be easily extended by connecting it with different re-balance preprocessing, different sparse regularization structures as well as different classifiers. As such, the algorithm is extremely general and has a wide range of applicability.
Collapse
Affiliation(s)
- Guang-Hui Fu
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China.
| | - Yuan-Jiao Wu
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China
| | - Min-Jie Zong
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China
| | - Jianxin Pan
- School of Mathematics, The University of Manchester, Manchester, M13 9PL, UK
| |
Collapse
|
43
|
Hemmerich J, Asilar E, Ecker GF. COVER: conformational oversampling as data augmentation for molecules. J Cheminform 2020; 12:18. [PMID: 33430975 PMCID: PMC7080709 DOI: 10.1186/s13321-020-00420-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Accepted: 02/18/2020] [Indexed: 01/09/2023] Open
Abstract
Training neural networks with small and imbalanced datasets often leads to overfitting and disregard of the minority class. For predictive toxicology, however, models with a good balance between sensitivity and specificity are needed. In this paper we introduce conformational oversampling as a means to balance and oversample datasets for prediction of toxicity. Conformational oversampling enhances a dataset by generation of multiple conformations of a molecule. These conformations can be used to balance, as well as oversample a dataset, thereby increasing the dataset size without the need of artificial samples. We show that conformational oversampling facilitates training of neural networks and provides state-of-the-art results on the Tox21 dataset.
Collapse
Affiliation(s)
- Jennifer Hemmerich
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstr 14, Vienna, Austria
| | - Ece Asilar
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstr 14, Vienna, Austria
| | - Gerhard F. Ecker
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstr 14, Vienna, Austria
| |
Collapse
|
44
|
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 2020; 21:6. [PMID: 31898477 PMCID: PMC6941312 DOI: 10.1186/s12864-019-6413-7] [Citation(s) in RCA: 1451] [Impact Index Per Article: 290.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Accepted: 12/18/2019] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets. RESULTS The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. CONCLUSIONS In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.
Collapse
Affiliation(s)
- Davide Chicco
- Krembil Research Institute, Toronto, Ontario, Canada
- Peter Munk Cardiac Centre, Toronto, Ontario, Canada
| | | |
Collapse
|
45
|
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2019.07.070] [Citation(s) in RCA: 100] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
46
|
Sarkar D, Saha S. Machine-learning techniques for the prediction of protein-protein interactions. J Biosci 2019; 44:104. [PMID: 31502581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein-protein interactions (PPIs) are important for the study of protein functions and pathways involved in different biological processes, as well as for understanding the cause and progression of diseases. Several high-throughput experimental techniques have been employed for the identification of PPIs in a few model organisms, but still, there is a huge gap in identifying all possible binary PPIs in an organism. Therefore, PPI prediction using machine-learning algorithms has been used in conjunction with experimental methods for discovery of novel protein interactions. The two most popular supervised machine-learning techniques used in the prediction of PPIs are support vector machines and random forest classifiers. Bayesian-probabilistic inference has also been used but mainly for the scoring of high-throughput PPI dataset confidence measures. Recently, deep-learning algorithms have been used for sequence-based prediction of PPIs. Several clustering methods such as hierarchical and k-means are useful as unsupervised machine-learning algorithms for the prediction of interacting protein pairs without explicit data labelling. In summary, machine-learning techniques have been widely used for the prediction of PPIs thus allowing experimental researchers to study cellular PPI networks.
Collapse
|
47
|
|
48
|
Sarkar D, Saha S. Machine-learning techniques for the prediction of protein–protein interactions. J Biosci 2019. [DOI: 10.1007/s12038-019-9909-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
49
|
Koh HWL, Fermin D, Vogel C, Choi KP, Ewing RM, Choi H. iOmicsPASS: network-based integration of multiomics data for predictive subnetwork discovery. NPJ Syst Biol Appl 2019; 5:22. [PMID: 31312515 PMCID: PMC6616462 DOI: 10.1038/s41540-019-0099-y] [Citation(s) in RCA: 69] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Accepted: 06/14/2019] [Indexed: 12/15/2022] Open
Abstract
Computational tools for multiomics data integration have usually been designed for unsupervised detection of multiomics features explaining large phenotypic variations. To achieve this, some approaches extract latent signals in heterogeneous data sets from a joint statistical error model, while others use biological networks to propagate differential expression signals and find consensus signatures. However, few approaches directly consider molecular interaction as a data feature, the essential linker between different omics data sets. The increasing availability of genome-scale interactome data connecting different molecular levels motivates a new class of methods to extract interactive signals from multiomics data. Here we developed iOmicsPASS, a tool to search for predictive subnetworks consisting of molecular interactions within and between related omics data types in a supervised analysis setting. Based on user-provided network data and relevant omics data sets, iOmicsPASS computes a score for each molecular interaction, and applies a modified nearest shrunken centroid algorithm to the scores to select densely connected subnetworks that can accurately predict each phenotypic group. iOmicsPASS detects a sparse set of predictive molecular interactions without loss of prediction accuracy compared to alternative methods, and the selected network signature immediately provides mechanistic interpretation of the multiomics profile representing each sample group. Extensive simulation studies demonstrate clear benefit of interaction-level modeling. iOmicsPASS analysis of TCGA/CPTAC breast cancer data also highlights new transcriptional regulatory network underlying the basal-like subtype as positive protein markers, a result not seen through analysis of individual omics data.
Collapse
Affiliation(s)
- Hiromi W. L. Koh
- Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Singapore
| | - Damian Fermin
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Christine Vogel
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY 10003 USA
| | - Kwok Pui Choi
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
| | - Rob M. Ewing
- School of Biological Sciences, University of Southampton, Southampton, UK
| | - Hyungwon Choi
- Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Singapore
- Institute of Molecular and Cell Biology, Agency for Science, Technology and Research, Singapore, Singapore
| |
Collapse
|
50
|
Large-Area, High Spatial Resolution Land Cover Mapping Using Random Forests, GEOBIA, and NAIP Orthophotography: Findings and Recommendations. REMOTE SENSING 2019. [DOI: 10.3390/rs11121409] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Despite the need for quality land cover information, large-area, high spatial resolution land cover mapping has proven to be a difficult task for a variety of reasons including large data volumes, complexity of developing training and validation datasets, data availability, and heterogeneity in data and landscape conditions. We investigate the use of geographic object-based image analysis (GEOBIA), random forest (RF) machine learning, and National Agriculture Imagery Program (NAIP) orthophotography for mapping general land cover across the entire state of West Virginia, USA, an area of roughly 62,000 km2. We obtained an overall accuracy of 96.7% and a Kappa statistic of 0.886 using a combination of NAIP orthophotography and ancillary data. Despite the high overall classification accuracy, some classes were difficult to differentiate, as highlight by the low user’s and producer’s accuracies for the barren, impervious, and mixed developed classes. In contrast, forest, low vegetation, and water were generally mapped with accuracy. The inclusion of ancillary data and first- and second-order textural measures generally improved classification accuracy whereas band indices and object geometric measures were less valuable. Including super-object attributes improved the classification slightly; however, this increased the computational time and complexity. From the findings of this research and previous studies, recommendations are provided for mapping large spatial extents.
Collapse
|