1
|
Wang DC, Xu WD, Qin Z, Fu L, Lan YY, Liu XY, Huang AF. Systemic lupus erythematosus with high disease activity identification based on machine learning. Inflamm Res 2023; 72:1909-1918. [PMID: 37725103 DOI: 10.1007/s00011-023-01793-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Revised: 08/22/2023] [Accepted: 08/28/2023] [Indexed: 09/21/2023] Open
Abstract
OBJECTIVE Clinical evaluation of systemic lupus erythematosus (SLE) disease activity is limited and inconsistent, and high disease activity significantly, seriously impacts on SLE patients. This study aims to generate a machine learning model to identify SLE patients with high disease activity. METHOD A total of 1014 SLE patients with low disease activity and 453 SLE patients with high disease activity were included. A total of 94 clinical, laboratory data and 17 meteorological indicators were collected. After data preprocessing, we use mutual information and multisurf to evaluate and select the importance of features. The selected features are used for machine learning modeling. Performance of the model is evaluated and verified by a series of binary classification indicators. RESULTS We screened out hematuria, proteinuria, pyuria, low complement, precipitation, sunlight and other features for model construction by integrated feature selection. After hyperparameter optimization, the LGB has the best performance (ROC: AUC = 0.930; PRC: AUC = 0.911, APS = 0.913; balance accuracy: 0.856), and the worst is the naive bayes (ROC: AUC = 0.849; PRC: AUC = 0.719, APS = 0.714; balance accuracy: 0.705). Finally, the selection of features has good consistency in the composite feature importance bar plot. CONCLUSION We identify SLE patients with high disease activity by a simple machine learning pipeline, especially the LGB model based on the characteristics of proteinuria, hematuria, pyuria and other feathers screened out by collective feature selection.
Collapse
Affiliation(s)
- Da-Cheng Wang
- Department of Evidence-Based Medicine, Southwest Medical University, 1 Xianglin Road, Luzhou, 646000, Sichuan, China
| | - Wang-Dong Xu
- Department of Evidence-Based Medicine, Southwest Medical University, 1 Xianglin Road, Luzhou, 646000, Sichuan, China.
| | - Zhen Qin
- Department of Rheumatology and Immunology, Affiliated Hospital of Southwest Medical University, 25 Taiping Road, Luzhou, 646000, Sichuan, China
| | - Lu Fu
- Laboratory Animal Center, Southwest Medical University, 1 Xianglin Road, Luzhou, 646000, Sichuan, China
| | - You-Yu Lan
- Department of Rheumatology and Immunology, Affiliated Hospital of Southwest Medical University, 25 Taiping Road, Luzhou, 646000, Sichuan, China
| | - Xiao-Yan Liu
- Department of Evidence-Based Medicine, Southwest Medical University, 1 Xianglin Road, Luzhou, 646000, Sichuan, China
| | - An-Fang Huang
- Department of Rheumatology and Immunology, Affiliated Hospital of Southwest Medical University, 25 Taiping Road, Luzhou, 646000, Sichuan, China.
| |
Collapse
|
2
|
Wang DC, Xu WD, Wang SN, Wang X, Leng W, Fu L, Liu XY, Qin Z, Huang AF. Lupus nephritis or not? A simple and clinically friendly machine learning pipeline to help diagnosis of lupus nephritis. Inflamm Res 2023:10.1007/s00011-023-01755-7. [PMID: 37300586 DOI: 10.1007/s00011-023-01755-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 05/17/2023] [Accepted: 05/30/2023] [Indexed: 06/12/2023] Open
Abstract
OBJECTIVE Diagnosis of lupus nephritis (LN) is a complex process, which usually requires renal biopsy. We aim to establish a machine learning pipeline to help diagnosis of LN. METHODS A cohort of 681 systemic lupus erythematosus (SLE) patients without LN and 786 SLE patients with LN was established, and a total of 95 clinical, laboratory data and 17 meteorological indicators were collected. After tenfold cross-validation, the patients were divided into training set and test set. The features selected by collective feature selection method of mutual information (MI) and multisurf were used to construct the models of logistic regression, decision tree, random forest, naive Bayes, support vector machine (SVM), light gradient boosting (LGB), extreme gradient boosting (XGB), and artificial neural network (ANN), the models were compared and verified in post-analysis. RESULTS Collective feature selection method screens out antistreptolysin (ASO), retinol binding protein (RBP), lupus anticoagulant 1 (LA1), LA2, proteinuria and other features, and the hyperparameter optimized XGB (ROC: AUC = 0.995; PRC: AUC = 1.000, APS = 1.000; balance accuracy: 0.990) has the best performance, followed by LGB (ROC: AUC = 0.992; PRC: AUC = 0.997, APS = 0.977; balance accuracy: 0.957). The worst performance is naive Bayes model (ROC: AUC = 0.799; PRC: AUC = 0.822, APS = 0.823; balance accuracy: 0.693). In the composite feature importance bar plots, ASO, RF, Up/Ucr, and other features play important roles in LN. CONCLUSION We developed and validated a new and simple machine learning pathway for diagnosis of LN, especially the XGB model based on ASO, LA1, LA2, proteinuria, and other features screened out by collective feature selection.
Collapse
Affiliation(s)
- Da-Cheng Wang
- Department of Evidence-Based Medicine, Southwest Medical University, 1 Xianglin Road, Luzhou, Sichuan, China
| | - Wang-Dong Xu
- Department of Evidence-Based Medicine, Southwest Medical University, 1 Xianglin Road, Luzhou, Sichuan, China
| | - Shen-Nan Wang
- Luzhou Meteorological Bureau, 3 Songshan Road, Luzhou, Sichuan, China
| | - Xiang Wang
- Luzhou Meteorological Bureau, 3 Songshan Road, Luzhou, Sichuan, China
| | - Wei Leng
- Luzhou Meteorological Bureau, 3 Songshan Road, Luzhou, Sichuan, China
| | - Lu Fu
- Laboratory Animal Center, Southwest Medical University, 1 Xianglin Road, Luzhou, Sichuan, China
| | - Xiao-Yan Liu
- Department of Evidence-Based Medicine, Southwest Medical University, 1 Xianglin Road, Luzhou, Sichuan, China
| | - Zhen Qin
- Department of Rheumatology and Immunology, Affiliated Hospital of Southwest Medical University, 25 Taiping Road, Luzhou, Sichuan, China
| | - An-Fang Huang
- Department of Rheumatology and Immunology, Affiliated Hospital of Southwest Medical University, 25 Taiping Road, Luzhou, Sichuan, China.
| |
Collapse
|
3
|
Evidence for Epistatic Interaction between HLA-G and LILRB1 in the Pathogenesis of Nonsegmental Vitiligo. Cells 2023; 12:cells12040630. [PMID: 36831297 PMCID: PMC9954564 DOI: 10.3390/cells12040630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/31/2022] [Accepted: 01/29/2023] [Indexed: 02/18/2023] Open
Abstract
Vitiligo is the most frequent cause of depigmentation worldwide. Genetic association studies have discovered about 50 loci associated with disease, many with immunological functions. Among them is HLA-G, which modulates immunity by interacting with specific inhibitory receptors, mainly LILRB1 and LILRB2. Here we investigated the LILRB1 and LILRB2 association with vitiligo risk and evaluated the possible role of interactions between HLA-G and its receptors in this pathogenesis. We tested the association of the polymorphisms of HLA-G, LILRB1, and LILRB2 with vitiligo using logistic regression along with adjustment by ancestry. Further, methods based on the multifactor dimensionality reduction (MDR) approach (MDR v.3.0.2, GMDR v.0.9, and MB-MDR) were used to detect potential epistatic interactions between polymorphisms from the three genes. An interaction involving rs9380142 and rs2114511 polymorphisms was identified by all methods used. The polymorphism rs9380142 is an HLA-G 3'UTR variant (+3187) with a well-established role in mRNA stability. The polymorphism rs2114511 is located in the exonic region of LILRB1. Although no association involving this SNP has been reported, ChIP-Seq experiments have identified this position as an EBF1 binding site. These results highlight the role of an epistatic interaction between HLA-G and LILRB1 in vitiligo pathogenesis.
Collapse
|
4
|
Hwang S, Urbanowicz R, Lynch S, Vernon T, Bresz K, Giraldo C, Kennedy E, Leabhart M, Bleacher T, Ripchinski MR, Mowery DL, Oyer RA. Toward Predicting 30-Day Readmission Among Oncology Patients: Identifying Timely and Actionable Risk Factors. JCO Clin Cancer Inform 2023; 7:e2200097. [PMID: 36809006 PMCID: PMC10476733 DOI: 10.1200/cci.22.00097] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 09/05/2022] [Accepted: 01/13/2023] [Indexed: 02/23/2023] Open
Abstract
PURPOSE Predicting 30-day readmission risk is paramount to improving the quality of patient care. In this study, we compare sets of patient-, provider-, and community-level variables that are available at two different points of a patient's inpatient encounter (first 48 hours and the full encounter) to train readmission prediction models and identify possible targets for appropriate interventions that can potentially reduce avoidable readmissions. METHODS Using electronic health record data from a retrospective cohort of 2,460 oncology patients and a comprehensive machine learning analysis pipeline, we trained and tested models predicting 30-day readmission on the basis of data available within the first 48 hours of admission and from the entire hospital encounter. RESULTS Leveraging all features, the light gradient boosting model produced higher, but comparable performance (area under receiver operating characteristic curve [AUROC]: 0.711) with the Epic model (AUROC: 0.697). Given features in the first 48 hours, the random forest model produces higher AUROC (0.684) than the Epic model (AUROC: 0.676). Both models flagged patients with a similar distribution of race and sex; however, our light gradient boosting and random forest models were more inclusive, flagging more patients among younger age groups. The Epic models were more sensitive to identifying patients with an average lower zip income. Our 48-hour models were powered by novel features at various levels: patient (weight change over 365 days, depression symptoms, laboratory values, and cancer type), hospital (winter discharge and hospital admission type), and community (zip income and marital status of partner). CONCLUSION We developed and validated models comparable with the existing Epic 30-day readmission models with several novel actionable insights that could create service interventions deployed by the case management or discharge planning teams that may decrease readmission rates over time.
Collapse
Affiliation(s)
- Sy Hwang
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA
| | - Ryan Urbanowicz
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA
- Department of Biostatistics, Epidemiology, & Informatics, University of Pennsylvania, Philadelphia, PA
| | - Selah Lynch
- Department of Biostatistics, Epidemiology, & Informatics, University of Pennsylvania, Philadelphia, PA
| | - Tawnya Vernon
- Ann B. Barshinger Cancer Institute (ABBCI), University of Pennsylvania, Philadelphia, PA
| | - Kellie Bresz
- Ann B. Barshinger Cancer Institute (ABBCI), University of Pennsylvania, Philadelphia, PA
| | - Carolina Giraldo
- Ann B. Barshinger Cancer Institute (ABBCI), University of Pennsylvania, Philadelphia, PA
- Osteopathic Medicine, Philadelphia College of Osteopathic Medicine, Philadelphia, PA
| | - Erin Kennedy
- Department of Nursing, University of Pennsylvania, Philadelphia, PA
| | - Max Leabhart
- Ann B. Barshinger Cancer Institute (ABBCI), University of Pennsylvania, Philadelphia, PA
| | - Troy Bleacher
- Ann B. Barshinger Cancer Institute (ABBCI), University of Pennsylvania, Philadelphia, PA
| | - Michael R. Ripchinski
- Ann B. Barshinger Cancer Institute (ABBCI), University of Pennsylvania, Philadelphia, PA
| | - Danielle L. Mowery
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA
- Department of Biostatistics, Epidemiology, & Informatics, University of Pennsylvania, Philadelphia, PA
- Abramson Cancer Center, University of Pennsylvania, Philadelphia, PA
| | - Randall A. Oyer
- Ann B. Barshinger Cancer Institute (ABBCI), University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
5
|
Chushig-Muzo D, Soguero-Ruiz C, Miguel Bohoyo PD, Mora-Jiménez I. Learning and visualizing chronic latent representations using electronic health records. BioData Min 2022; 15:18. [PMID: 36064616 PMCID: PMC9446539 DOI: 10.1186/s13040-022-00303-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 07/27/2022] [Indexed: 12/03/2022] Open
Abstract
Background Nowadays, patients with chronic diseases such as diabetes and hypertension have reached alarming numbers worldwide. These diseases increase the risk of developing acute complications and involve a substantial economic burden and demand for health resources. The widespread adoption of Electronic Health Records (EHRs) is opening great opportunities for supporting decision-making. Nevertheless, data extracted from EHRs are complex (heterogeneous, high-dimensional and usually noisy), hampering the knowledge extraction with conventional approaches. Methods We propose the use of the Denoising Autoencoder (DAE), a Machine Learning (ML) technique allowing to transform high-dimensional data into latent representations (LRs), thus addressing the main challenges with clinical data. We explore in this work how the combination of LRs with a visualization method can be used to map the patient data in a two-dimensional space, gaining knowledge about the distribution of patients with different chronic conditions. Furthermore, this representation can be also used to characterize the patient’s health status evolution, which is of paramount importance in the clinical setting. Results To obtain clinical LRs, we considered real-world data extracted from EHRs linked to the University Hospital of Fuenlabrada in Spain. Experimental results showed the great potential of DAEs to identify patients with clinical patterns linked to hypertension, diabetes and multimorbidity. The procedure allowed us to find patients with the same main chronic disease but different clinical characteristics. Thus, we identified two kinds of diabetic patients with differences in their drug therapy (insulin and non-insulin dependant), and also a group of women affected by hypertension and gestational diabetes. We also present a proof of concept for mapping the health status evolution of synthetic patients when considering the most significant diagnoses and drugs associated with chronic patients. Conclusion Our results highlighted the value of ML techniques to extract clinical knowledge, supporting the identification of patients with certain chronic conditions. Furthermore, the patient’s health status progression on the two-dimensional space might be used as a tool for clinicians aiming to characterize health conditions and identify their more relevant clinical codes. Supplementary Information The online version contains supplementary material available at (10.1186/s13040-022-00303-z).
Collapse
Affiliation(s)
- David Chushig-Muzo
- Department of Signal Theory and Communications, Telematics and Computing Systems, Rey Juan Carlos University, Madrid, Spain
| | - Cristina Soguero-Ruiz
- Department of Signal Theory and Communications, Telematics and Computing Systems, Rey Juan Carlos University, Madrid, Spain
| | | | - Inmaculada Mora-Jiménez
- Department of Signal Theory and Communications, Telematics and Computing Systems, Rey Juan Carlos University, Madrid, Spain.
| |
Collapse
|
6
|
Pudjihartono N, Fadason T, Kempa-Liehr AW, O'Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. FRONTIERS IN BIOINFORMATICS 2022; 2:927312. [PMID: 36304293 PMCID: PMC9580915 DOI: 10.3389/fbinf.2022.927312] [Citation(s) in RCA: 75] [Impact Index Per Article: 37.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 06/03/2022] [Indexed: 01/14/2023] Open
Abstract
Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called “curse of dimensionality” (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most “informative” features and remove noisy “non-informative,” irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.
Collapse
Affiliation(s)
| | - Tayaza Fadason
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
| | - Andreas W. Kempa-Liehr
- Department of Engineering Science, The University of Auckland, Auckland, New Zealand
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| | - Justin M. O'Sullivan
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Australian Parkinson’s Mission, Garvan Institute of Medical Research, Sydney, NSW, Australia
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| |
Collapse
|
7
|
Chicco D, Faultless T. Brief Survey on Machine Learning in Epistasis. Methods Mol Biol 2021; 2212:169-179. [PMID: 33733356 DOI: 10.1007/978-1-0716-0947-7_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
Abstract
In biology, the term "epistasis" indicates the effect of the interaction of a gene with another gene. A gene can interact with an independently sorted gene, located far away on the chromosome or on an entirely different chromosome, and this interaction can have a strong effect on the function of the two genes. These changes then can alter the consequences of the biological processes, influencing the organism's phenotype. Machine learning is an area of computer science that develops statistical methods able to recognize patterns from data. A typical machine learning algorithm consists of a training phase, where the model learns to recognize specific trends in the data, and a test phase, where the trained model applies its learned intelligence to recognize trends in external data. Scientists have applied machine learning to epistasis problems multiple times, especially to identify gene-gene interactions from genome-wide association study (GWAS) data. In this brief survey, we report and describe the main scientific articles published in data mining and epistasis. Our article confirms the effectiveness of machine learning in this genetics subfield.
Collapse
Affiliation(s)
- Davide Chicco
- Krembil Research Institute, Toronto, Ontario, Canada.
| | | |
Collapse
|
8
|
A supervised machine learning-based methodology for analyzing dysregulation in splicing machinery: An application in cancer diagnosis. Artif Intell Med 2020; 108:101950. [PMID: 32972670 DOI: 10.1016/j.artmed.2020.101950] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Revised: 08/15/2020] [Accepted: 08/18/2020] [Indexed: 02/06/2023]
Abstract
Deregulated splicing machinery components have shown to be associated with the development of several types of cancer and, therefore, the determination of such alterations can help the development of tumor-specific molecular targets for early prognosis and therapy. Determining such splicing components, however, is not a straightforward task mainly due to the heterogeneity of tumors, the variability across samples, and the fat-short characteristic of genomic datasets. In this work, a supervised machine learning-based methodology is proposed, allowing the determination of subsets of relevant splicing components that best discriminate samples. The methodology comprises three main phases: first, a ranking of features is determined by means of applying feature weighting algorithms that compute the importance of each splicing component; second, the best subset of features that allows the induction of an accurate classifier is determined by means of conducting an effective heuristic search; then the confidence over the induced classifier is assessed by means of explaining the individual predictions and its global behavior. At the end, an extensive experimental study was conducted on a large collection of transcript-based datasets, illustrating the utility and benefit of the proposed methodology for analyzing dysregulation in splicing machinery.
Collapse
|
9
|
Sharif Bidabadi S, Murray I, Lee GYF, Morris S, Tan T. Classification of foot drop gait characteristic due to lumbar radiculopathy using machine learning algorithms. Gait Posture 2019; 71:234-240. [PMID: 31082655 DOI: 10.1016/j.gaitpost.2019.05.010] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/06/2019] [Revised: 04/13/2019] [Accepted: 05/03/2019] [Indexed: 02/02/2023]
Abstract
BACKGROUND Recently, the study of walking gait has received significant attention due to the importance of identifying disorders relating to gait patterns. Characterisation and classification of different common gait disorders such as foot drop in an effective and accurate manner can lead to improved diagnosis, prognosis assessment, and treatment. However, currently visual inspection is the main clinical method to evaluate gait disorders, which is reliant on the subjectivity of the observer, leading to inaccuracies. RESEARCH QUESTION This study examines if it is feasible to use commercial off-the-shelf Inertial measurement unit sensors and supervised learning methods to distinguish foot drop gait disorder from the normal walking gait pattern. METHOD The gait data collected from 56 adults diagnosed with foot drop due to L5 lumbar radiculopathy (with MRI verified compressive pathology), and 30 adults with normal gait during multiple walking trials on a flat surface. Machine learning algorithms were applied to the inertial sensor data to investigate the feasibility of classifying foot drop disorder. RESULTS The best three performing results were 88.45%, 86.87% and 86.08% accuracy derived from the Random Forest, SVM, and Naive Bayes classifiers respectively. After applying the wrapper feature selection technique, the top performance was from the Random Forest classifier with an overall accuracy of 93.18%. SIGNIFICANCE It is demonstrated that the combination of inertial sensors and machine learning algorithms, provides a promising and feasible solution to differentiating L5 radiculopathy related foot drop from normal walking gait patterns. The implication of this finding is to provide an objective method to help clinical decision making.
Collapse
Affiliation(s)
- Shiva Sharif Bidabadi
- School of Civil and Mechanical Engineering, Curtin University of Technology, Perth, Australia.
| | - Iain Murray
- School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University of Technology, Perth, Australia.
| | - Gabriel Yin Foo Lee
- St John of God Subiaco Hospital Perth, Australia; School of Surgery of University of Western Australia, Australia.
| | - Susan Morris
- School of Physiotherapy and Exercise Science, Curtin University of Technology, Perth, Australia.
| | - Tele Tan
- School of Civil and Mechanical Engineering, Curtin University of Technology, Perth, Australia.
| |
Collapse
|
10
|
Kafaie S, Chen Y, Hu T. A network approach to prioritizing susceptibility genes for genome-wide association studies. Genet Epidemiol 2019; 43:477-491. [PMID: 30859622 DOI: 10.1002/gepi.22198] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Revised: 01/31/2019] [Accepted: 02/25/2019] [Indexed: 12/22/2022]
Abstract
The heritability of complex diseases including cancer is often attributed to multiple interacting genetic alterations. Such a non-linear, non-additive gene-gene interaction effect, that is, epistasis, renders univariable analysis methods ineffective for genome-wide association studies. In recent years, network science has seen increasing applications in modeling epistasis to characterize the complex relationships between a large number of genetic variations and the phenotypic outcome. In this study, by constructing a statistical epistasis network of colorectal cancer (CRC), we proposed to use multiple network measures to prioritize genes that influence the disease risk of CRC through synergistic interaction effects. We computed and analyzed several global and local properties of the large CRC epistasis network. We utilized topological properties of network vertices such as the edge strength, vertex centrality, and occurrence at different graphlets to identify genes that may be of potential biological relevance to CRC. We found 512 top-ranked single-nucleotide polymorphisms, among which COL22A1, RGS7, WWOX, and CELF2 were the four susceptibility genes prioritized by all described metrics as the most influential on CRC.
Collapse
Affiliation(s)
- Somayeh Kafaie
- Department of Computer Science, Memorial University, St. John's, NL, Canada
| | - Yuanzhu Chen
- Department of Computer Science, Memorial University, St. John's, NL, Canada
| | - Ting Hu
- Department of Computer Science, Memorial University, St. John's, NL, Canada
| |
Collapse
|
11
|
Bobak CA, Titus AJ, Hill JE. Comparison of common machine learning models for classification of tuberculosis using transcriptional biomarkers from integrated datasets. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2018.10.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
12
|
Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH. Relief-based feature selection: Introduction and review. J Biomed Inform 2018; 85:189-203. [PMID: 30031057 PMCID: PMC6299836 DOI: 10.1016/j.jbi.2018.07.014] [Citation(s) in RCA: 314] [Impact Index Per Article: 52.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Revised: 06/29/2018] [Accepted: 07/14/2018] [Indexed: 01/25/2023]
Abstract
Feature selection plays a critical role in biomedical data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. interactions, so that informative features are not mistakenly eliminated prior to downstream modeling. This paper focuses on Relief-based algorithms (RBAs), a unique family of filter-style feature selection algorithms that have gained appeal by striking an effective balance between these objectives while flexibly adapting to various data characteristics, e.g. classification vs. regression. First, this work broadly examines types of feature selection and defines RBAs within that context. Next, we introduce the original Relief algorithm and associated concepts, emphasizing the intuition behind how it works, how feature weights generated by the algorithm can be interpreted, and why it is sensitive to feature interactions without evaluating combinations of features. Lastly, we include an expansive review of RBA methodological research beyond Relief and its popular descendant, ReliefF. In particular, we characterize branches of RBA research, and provide comparative summaries of RBA algorithms including contributions, strategies, functionality, time complexity, adaptation to key data characteristics, and software availability.
Collapse
Affiliation(s)
- Ryan J Urbanowicz
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | | | - William La Cava
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Randal S Olson
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
13
|
Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH. Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform 2018; 85:168-188. [PMID: 30030120 PMCID: PMC6299838 DOI: 10.1016/j.jbi.2018.07.015] [Citation(s) in RCA: 80] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Revised: 06/30/2018] [Accepted: 07/14/2018] [Indexed: 11/23/2022]
Abstract
Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. 'omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by the 'Relief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF∗ performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.
Collapse
Affiliation(s)
- Ryan J Urbanowicz
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Randal S Olson
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Peter Schmitt
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | | | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|