51
|
Piccolo SR, Lee TJ, Suh E, Hill K. ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data. Gigascience 2020; 9:giaa026. [PMID: 32249316 PMCID: PMC7131989 DOI: 10.1093/gigascience/giaa026] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Revised: 12/05/2019] [Accepted: 02/28/2020] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND Classification algorithms assign observations to groups based on patterns in data. The machine-learning community have developed myriad classification algorithms, which are used in diverse life science research domains. Algorithm choice can affect classification accuracy dramatically, so it is crucial that researchers optimize the choice of which algorithm(s) to apply in a given research domain on the basis of empirical evidence. In benchmark studies, multiple algorithms are applied to multiple datasets, and the researcher examines overall trends. In addition, the researcher may evaluate multiple hyperparameter combinations for each algorithm and use feature selection to reduce data dimensionality. Although software implementations of classification algorithms are widely available, robust benchmark comparisons are difficult to perform when researchers wish to compare algorithms that span multiple software packages. Programming interfaces, data formats, and evaluation procedures differ across software packages; and dependency conflicts may arise during installation. FINDINGS To address these challenges, we created ShinyLearner, an open-source project for integrating machine-learning packages into software containers. ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons. In addition, ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross-validation; it tracks all nested operations and generates output files that make these steps transparent. ShinyLearner includes a Web interface to help users more easily construct the commands necessary to perform benchmark comparisons. ShinyLearner is freely available at https://github.com/srp33/ShinyLearner. CONCLUSIONS This software is a resource to researchers who wish to benchmark multiple classification or feature-selection algorithms on a given dataset. We hope it will serve as example of combining the benefits of software containerization with a user-friendly approach.
Collapse
Affiliation(s)
- Stephen R Piccolo
- Department of Biology, Brigham Young University, 4102 Life Sciences Building, Provo, UT, 84602, USA
| | - Terry J Lee
- Department of Biology, Brigham Young University, 4102 Life Sciences Building, Provo, UT, 84602, USA
| | - Erica Suh
- Department of Biology, Brigham Young University, 4102 Life Sciences Building, Provo, UT, 84602, USA
| | - Kimball Hill
- Department of Biology, Brigham Young University, 4102 Life Sciences Building, Provo, UT, 84602, USA
| |
Collapse
|
52
|
Hammami M, Bechikh S, Louati A, Makhlouf M, Said LB. Feature construction as a bi-level optimization problem. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-04784-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
53
|
Yoo TK, Ryu IH, Choi H, Kim JK, Lee IS, Kim JS, Lee G, Rim TH. Explainable Machine Learning Approach as a Tool to Understand Factors Used to Select the Refractive Surgery Technique on the Expert Level. Transl Vis Sci Technol 2020; 9:8. [PMID: 32704414 PMCID: PMC7346876 DOI: 10.1167/tvst.9.2.8] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Accepted: 11/18/2019] [Indexed: 12/23/2022] Open
Abstract
Purpose Recently, laser refractive surgery options, including laser epithelial keratomileusis, laser in situ keratomileusis, and small incision lenticule extraction, successfully improved patients' quality of life. Evidence-based recommendation for an optimal surgery technique is valuable in increasing patient satisfaction. We developed an interpretable multiclass machine learning model that selects the laser surgery option on the expert level. Methods A multiclass XGBoost model was constructed to classify patients into four categories including laser epithelial keratomileusis, laser in situ keratomileusis, small incision lenticule extraction, and contraindication groups. The analysis included 18,480 subjects who intended to undergo refractive surgery at the B&VIIT Eye center. Training (n = 10,561) and internal validation (n = 2640) were performed using subjects who visited between 2016 and 2017. The model was trained based on clinical decisions of highly experienced experts and ophthalmic measurements. External validation (n = 5279) was conducted using subjects who visited in 2018. The SHapley Additive ex-Planations technique was adopted to explain the output of the XGBoost model. Results The multiclass XGBoost model exhibited an accuracy of 81.0% and 78.9% when tested on the internal and external validation datasets, respectively. The SHapley Additive ex-Planations explanations for the results were consistent with prior knowledge from ophthalmologists. The explanation from one-versus-one and one-versus-rest XGBoost classifiers was effective for easily understanding users in the multicategorical classification problem. Conclusions This study suggests an expert-level multiclass machine learning model for selecting the refractive surgery for patients. It also provided a clinical understanding in a multiclass problem based on an explainable artificial intelligence technique. Translational Relevance Explainable machine learning exhibits a promising future for increasing the practical use of artificial intelligence in ophthalmic clinics.
Collapse
Affiliation(s)
- Tae Keun Yoo
- Department of Ophthalmology, Aerospace Medical Center, Republic of Korea Air Force, Cheongju, South Korea
| | | | | | | | | | | | | | - Tyler Hyungtaek Rim
- Singapore Eye Research Institute, Singapore National Eye Centre, Duke-NUS Medical School, Singapore, Singapore
| |
Collapse
|
54
|
Xu Z, Chou J, Zhang XS, Luo Y, Isakova T, Adekkanattu P, Ancker JS, Jiang G, Kiefer RC, Pacheco JA, Rasmussen LV, Pathak J, Wang F. Identifying sub-phenotypes of acute kidney injury using structured and unstructured electronic health record data with memory networks. J Biomed Inform 2020; 102:103361. [PMID: 31911172 DOI: 10.1016/j.jbi.2019.103361] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Revised: 11/18/2019] [Accepted: 12/16/2019] [Indexed: 01/08/2023]
Abstract
Acute Kidney Injury (AKI) is a common clinical syndrome characterized by the rapid loss of kidney excretory function, which aggravates the clinical severity of other diseases in a large number of hospitalized patients. Accurate early prediction of AKI can enable in-time interventions and treatments. However, AKI is highly heterogeneous, thus identification of AKI sub-phenotypes can lead to an improved understanding of the disease pathophysiology and development of more targeted clinical interventions. This study used a memory network-based deep learning approach to discover AKI sub-phenotypes using structured and unstructured electronic health record (EHR) data of patients before AKI diagnosis. We leveraged a real world critical care EHR corpus including 37,486 ICU stays. Our approach identified three distinct sub-phenotypes: sub-phenotype I is with an average age of 63.03±17.25 years, and is characterized by mild loss of kidney excretory function (Serum Creatinine (SCr) 1.55±0.34 mg/dL, estimated Glomerular Filtration Rate Test (eGFR) 107.65±54.98 mL/min/1.73 m2). These patients are more likely to develop stage I AKI. Sub-phenotype II is with average age 66.81±10.43 years, and was characterized by severe loss of kidney excretory function (SCr 1.96±0.49 mg/dL, eGFR 82.19±55.92 mL/min/1.73 m2). These patients are more likely to develop stage III AKI. Sub-phenotype III is with average age 65.07±11.32 years, and was characterized moderate loss of kidney excretory function and thus more likely to develop stage II AKI (SCr 1.69±0.32 mg/dL, eGFR 93.97±56.53 mL/min/1.73 m2). Both SCr and eGFR are significantly different across the three sub-phenotypes with statistical testing plus postdoc analysis, and the conclusion still holds after age adjustment.
Collapse
Affiliation(s)
| | | | | | - Yuan Luo
- Northwestern University, Chicago, IL, USA
| | | | | | | | | | | | | | | | | | - Fei Wang
- Weill Cornell Medicine, New York, NY, USA.
| |
Collapse
|
55
|
Dragonfly Algorithm: Theory, Literature Review, and Application in Feature Selection. NATURE-INSPIRED OPTIMIZERS 2020. [DOI: 10.1007/978-3-030-12127-3_4] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|
56
|
Wnt/ β-Catenin, Carbohydrate Metabolism, and PI3K-Akt Signaling Pathway-Related Genes as Potential Cancer Predictors. JOURNAL OF HEALTHCARE ENGINEERING 2019; 2019:9724589. [PMID: 31781361 PMCID: PMC6855054 DOI: 10.1155/2019/9724589] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/17/2019] [Accepted: 09/17/2019] [Indexed: 01/07/2023]
Abstract
Predicting the outcome after a cancer diagnosis is critical. Advances in high-throughput sequencing technologies provide physicians with vast amounts of data, yet prognostication remains challenging because the data are greatly dimensional and complex. We evaluated Wnt/β-catenin, carbohydrate metabolism, and PI3K-Akt signaling pathway-related genes as predictive features for classifying tumors and normal samples. Using differentially expressed genes as controls, these pathway-related genes were assessed for accuracy using support-vector machines and three other recommended machine learning models, namely, the random forest, decision tree, and k-nearest neighbor algorithms. The first two outperformed the others. All candidate pathway-related genes yielded areas under the curve exceeding 95.00% for cancer outcomes, and they were most accurate in predicting colorectal cancer. These results suggest that these pathway-related genes are useful and accurate biomarkers for understanding the mechanisms behind cancer development.
Collapse
|
57
|
Bir-Jmel A, Douiri SM, Elbernoussi S. Gene Selection via a New Hybrid Ant Colony Optimization Algorithm for Cancer Classification in High-Dimensional Data. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:7828590. [PMID: 31737086 PMCID: PMC6815598 DOI: 10.1155/2019/7828590] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Revised: 08/14/2019] [Accepted: 09/09/2019] [Indexed: 11/18/2022]
Abstract
The recent advance in the microarray data analysis makes it easy to simultaneously measure the expression levels of several thousand genes. These levels can be used to distinguish cancerous tissues from normal ones. In this work, we are interested in gene expression data dimension reduction for cancer classification, which is a common task in most microarray data analysis studies. This reduction has an essential role in enhancing the accuracy of the classification task and helping biologists accurately predict cancer in the body; this is carried out by selecting a small subset of relevant genes and eliminating the redundant or noisy genes. In this context, we propose a hybrid approach (MWIS-ACO-LS) for the gene selection problem, based on the combination of a new graph-based approach for gene selection (MWIS), in which we seek to minimize the redundancy between genes by considering the correlation between the latter and maximize gene-ranking (Fisher) scores, and a modified ACO coupled with a local search (LS) algorithm using the classifier 1NN for measuring the quality of the candidate subsets. In order to evaluate the proposed method, we tested MWIS-ACO-LS on ten well-replicated microarray datasets of high dimensions varying from 2308 to 12600 genes. The experimental results based on ten high-dimensional microarray classification problems demonstrated the effectiveness of our proposed method.
Collapse
Affiliation(s)
- Ahmed Bir-Jmel
- Laboratory of Mathematics, Computer Science & Applications-Security of Information, Department of Mathematics, Faculty of Sciences, Mohammed V University, Rabat, Morocco
| | - Sidi Mohamed Douiri
- Laboratory of Mathematics, Computer Science & Applications-Security of Information, Department of Mathematics, Faculty of Sciences, Mohammed V University, Rabat, Morocco
| | - Souad Elbernoussi
- Laboratory of Mathematics, Computer Science & Applications-Security of Information, Department of Mathematics, Faculty of Sciences, Mohammed V University, Rabat, Morocco
| |
Collapse
|
58
|
Heitz F, Kommoss S, Tourani R, Grandelis A, Uppendahl L, Aliferis C, Burges A, Wang C, Canzler U, Wang J, Belau A, Prader S, Hanker L, Ma S, Ataseven B, Hilpert F, Schneider S, Sehouli J, Kimmig R, Kurzeder C, Schmalfeldt B, Braicu EI, Harter P, Dowdy SC, Winterhoff BJ, Pfisterer J, du Bois A. Dilution of Molecular-Pathologic Gene Signatures by Medically Associated Factors Might Prevent Prediction of Resection Status After Debulking Surgery in Patients With Advanced Ovarian Cancer. Clin Cancer Res 2019; 26:213-219. [PMID: 31527166 DOI: 10.1158/1078-0432.ccr-19-1741] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2019] [Revised: 08/08/2019] [Accepted: 09/11/2019] [Indexed: 11/16/2022]
Abstract
PURPOSE Predicting surgical outcome could improve individualizing treatment strategies for patients with advanced ovarian cancer. It has been suggested earlier that gene expression signatures (GES) might harbor the potential to predict surgical outcome. EXPERIMENTAL DESIGN Data derived from high-grade serous tumor tissue of FIGO stage IIIC/IV patients of AGO-OVAR11 trial were used to generate a transcriptome profiling. Previously identified molecular signatures were tested. A theoretical model was implemented to evaluate the impact of medically associated factors for residual disease (RD) on the performance of GES that predicts RD status. RESULTS A total of 266 patients met inclusion criteria, of those, 39.1% underwent complete resection. Previously reported GES did not predict RD in this cohort. Similarly, The Cancer Genome Atlas molecular subtypes, an independent de novo signature and the total gene expression dataset using all 21,000 genes were not able to predict RD status. Medical reasons for RD were identified as potential limiting factors that impact the ability to use GES to predict RD. In a center with high complete resection rates, a GES which would perfectly predict tumor biological RD would have a performance of only AUC 0.83, due to reasons other than tumor biology. CONCLUSIONS Previously identified GES cannot be generalized. Medically associated factors for RD may be the main obstacle to predict surgical outcome in an all-comer population of patients with advanced ovarian cancer. If biomarkers derived from tumor tissue are used to predict outcome of patients with cancer, selection bias should be focused on to prevent overestimation of the power of such a biomarker.See related commentary by Handley and Sood, p. 9.
Collapse
Affiliation(s)
- Florian Heitz
- Department of Gynecology and Gynecologic Oncology, Kliniken-Essen-Mitte, Germany. .,Charité - Universitätsmedizin Berlin, Humboldt-Universität zu Berlin, Berlin Institute of Health, Department of Gynecology, Berlin, Germany.,AGO Study Group
| | - Stefan Kommoss
- AGO Study Group.,Department of Women's Health, Tuebingen University Hospital, Tuebingen, Germany
| | - Roshan Tourani
- Institute for Health Informatics (IHI), Academic Health Center, University of Minnesota, Minneapolis, Minnesota
| | - Anthony Grandelis
- Department of Gynecology, Obstetrics and Women's Health, Division of Gynecologic Oncology, University of Minnesota, Minneapolis, Minnesota
| | - Locke Uppendahl
- Department of Gynecology, Obstetrics and Women's Health, Division of Gynecologic Oncology, University of Minnesota, Minneapolis, Minnesota
| | - Constantin Aliferis
- Institute for Health Informatics (IHI), Academic Health Center, University of Minnesota, Minneapolis, Minnesota
| | - Alexander Burges
- AGO Study Group.,Department of Obstetrics and Gynecology, University Hospital, LMU Munich, Germany
| | - Chen Wang
- Division of Gynecologic Surgery, Department of Obstetrics and Gynecology; Mayo Clinic, Rochester, Minnesota
| | - Ulrich Canzler
- AGO Study Group.,Department of Gynecology and Obstetrics, Technische Universität Dresden, Dresden, Germany
| | - Jinhua Wang
- Institute for Health Informatics (IHI), Academic Health Center, University of Minnesota, Minneapolis, Minnesota
| | - Antje Belau
- AGO Study Group.,Ernst Moritz Arndt Universität Greifswald - Klinik und Poliklinik für Frauenheilkunde und Geburtshilfe, Greifswald, Germany
| | - Sonia Prader
- Department of Gynecology and Gynecologic Oncology, Kliniken-Essen-Mitte, Germany
| | - Lars Hanker
- AGO Study Group.,Klinik für Frauenheilkunde und Geburtshilfe, University of Schleswig-Holstein, Lübeck, Germany
| | - Sisi Ma
- Institute for Health Informatics (IHI), Academic Health Center, University of Minnesota, Minneapolis, Minnesota
| | - Beyhan Ataseven
- Department of Gynecology and Gynecologic Oncology, Kliniken-Essen-Mitte, Germany.,Department of Obstetrics and Gynecology, University Hospital, LMU Munich, Germany
| | - Felix Hilpert
- AGO Study Group.,Krankenhaus Jerusalem Hamburg, Hamburg, Germany
| | - Stephanie Schneider
- Department of Gynecology and Gynecologic Oncology, Kliniken-Essen-Mitte, Germany
| | - Jalid Sehouli
- Charité - Universitätsmedizin Berlin, Humboldt-Universität zu Berlin, Berlin Institute of Health, Department of Gynecology, Berlin, Germany
| | - Rainer Kimmig
- AGO Study Group.,Department of Gynecology and Obstetrics, University of Duisburg-Essen, Essen, Germany
| | - Christian Kurzeder
- AGO Study Group.,Universitätsspital Basel, Basel, Switzerland.,Department of Obstrics and Gynecology, University of Ulm, Ulm, Germany
| | - Barbara Schmalfeldt
- AGO Study Group.,Technical University of Munich - Klinikum rechts der Isar, Munich, Germany.,Department of Gynecology and Obstetrics, Technical University of Munich, Munich, Germany
| | - Elena I Braicu
- Charité - Universitätsmedizin Berlin, Humboldt-Universität zu Berlin, Berlin Institute of Health, Department of Gynecology, Berlin, Germany
| | - Philipp Harter
- Department of Gynecology and Gynecologic Oncology, Kliniken-Essen-Mitte, Germany.,AGO Study Group
| | - Sean C Dowdy
- Division of Gynecologic Surgery, Department of Obstetrics and Gynecology; Mayo Clinic, Rochester, Minnesota
| | - Boris J Winterhoff
- Department of Gynecology, Obstetrics and Women's Health, Division of Gynecologic Oncology, University of Minnesota, Minneapolis, Minnesota
| | | | - Andreas du Bois
- Department of Gynecology and Gynecologic Oncology, Kliniken-Essen-Mitte, Germany.,AGO Study Group
| |
Collapse
|
59
|
Qayyum A, Saeed Malik A, Saad NM, Iqbal M, Abdullah MF, Rasheed W, Abdullah TABR, Bin Jafaar MY. Image classification based on sparse-coded features using sparse coding technique for aerial imagery: a hybrid dictionary approach. Neural Comput Appl 2019. [DOI: 10.1007/s00521-017-3300-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
60
|
Bang S, Yoo D, Kim SJ, Jhang S, Cho S, Kim H. Establishment and evaluation of prediction model for multiple disease classification based on gut microbial data. Sci Rep 2019; 9:10189. [PMID: 31308384 PMCID: PMC6629854 DOI: 10.1038/s41598-019-46249-x] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Accepted: 04/12/2019] [Indexed: 12/17/2022] Open
Abstract
Diseases prediction has been performed by machine learning approaches with various biological data. One of the representative data is the gut microbial community, which interacts with the host's immune system. The abundance of a few microorganisms has been used as markers to predict diverse diseases. In this study, we hypothesized that multi-classification using machine learning approach could distinguish the gut microbiome from following six diseases: multiple sclerosis, juvenile idiopathic arthritis, myalgic encephalomyelitis/chronic fatigue syndrome, acquired immune deficiency syndrome, stroke and colorectal cancer. We used the abundance of microorganisms at five taxonomy levels as features in 696 samples collected from different studies to establish the best prediction model. We built classification models based on four multi-class classifiers and two feature selection methods including a forward selection and a backward elimination. As a result, we found that the performance of classification is improved as we use the lower taxonomy levels of features; the highest performance was observed at the genus level. Among four classifiers, LogitBoost-based prediction model outperformed other classifiers. Also, we suggested the optimal feature subsets at the genus-level obtained by backward elimination. We believe the selected feature subsets could be used as markers to distinguish various diseases simultaneously. The finding in this study suggests the potential use of selected features for the diagnosis of several diseases.
Collapse
Affiliation(s)
- Sohyun Bang
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-742, Republic of Korea
- C&K genomics, Seoul National University Research Park, Seoul, 151-919, Republic of Korea
| | - DongAhn Yoo
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-742, Republic of Korea
| | - Soo-Jin Kim
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
| | - Soyun Jhang
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-742, Republic of Korea
- C&K genomics, Seoul National University Research Park, Seoul, 151-919, Republic of Korea
| | - Seoae Cho
- C&K genomics, Seoul National University Research Park, Seoul, 151-919, Republic of Korea
| | - Heebal Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-742, Republic of Korea.
- C&K genomics, Seoul National University Research Park, Seoul, 151-919, Republic of Korea.
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
61
|
Lin X, Huang X, Zhou L, Ren W, Zeng J, Yao W, Wang X. The Robust Classification Model Based on Combinatorial Features. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:650-657. [PMID: 29990202 DOI: 10.1109/tcbb.2017.2779512] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Analyzing the disease data from the view of combinatorial features may better characterize the disease phenotype. In this study, a novel method is proposed to construct feature combinations and a classification model (CFC-CM) by mining key feature relationships. CFC-CM iteratively tests for differences in the feature relationship between different groups. To do this, it uses a modified $k$k-top-scoring pair (M-$k$k-TSP) algorithm and then selects the most discriminative feature pairs in the current feature set to infer the combinatorial features and build the classification model. Compared with support vector machines, random forests, least absolute shrinkage and selection operator, elastic net, and M-$k$k-TSP, the superior performance of CFC-CM on nine public gene expression datasets validates its potential for more precise identification of complex diseases. Subsequently, CFC-CM was applied to two metabolomics datasets, it obtained accuracy rates of $88.73\pm 2.06\%$88.73±2.06% and $79.11\pm 2.70\%$79.11±2.70% in distinguishing between hepatocellular carcinoma and hepatic cirrhosis groups and between acute kidney injury (AKI) and non-AKI samples, results superior to those of the other five methods. In summary, the better results of CFC-CM show that in contrast to molecules and combinations constituted by just two features, the combinations inferred by appropriate number of features could better identify the complex diseases.
Collapse
|
62
|
Sumsion GR, Bradshaw MS, Hill KT, Pinto LD, Piccolo SR. Remote sensing tree classification with a multilayer perceptron. PeerJ 2019; 7:e6101. [PMID: 30842894 PMCID: PMC6397751 DOI: 10.7717/peerj.6101] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Accepted: 11/12/2018] [Indexed: 11/20/2022] Open
Abstract
To accelerate scientific progress on remote tree classification-as well as biodiversity and ecology sampling-The National Institute of Science and Technology created a community-based competition where scientists were invited to contribute informatics methods for classifying tree species and genus using crown-level images of trees. We classified tree species and genus at the pixel level using hyperspectral and LiDAR observations. We compared three algorithms that have been implemented extensively across a broad range of research applications: support vector machines, random forests, and multilayer perceptron. At the pixel level, the multilayer perceptron algorithm classified species or genus with high accuracy (92.7% and 95.9%, respectively) on the training data and performed better than the other two algorithms (85.8-93.5%). This indicates promise for the use of the multilayer perceptron (MLP) algorithm for tree-species classification based on hyperspectral and LiDAR observations and coincides with a growing body of research in which neural network-based algorithms outperform other types of classification algorithm for machine vision. To aggregate patterns across the images, we used an ensemble approach that averages the pixel-level outputs of the MLP algorithm to classify species at the crown level. The average accuracy of these classifications on the test set was 68.8% for the nine species.
Collapse
Affiliation(s)
- G Rex Sumsion
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Michael S. Bradshaw
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Kimball T. Hill
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Lucas D.G. Pinto
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| | - Stephen R. Piccolo
- Department of Biology, Brigham Young University, Provo, UT, United States of America
| |
Collapse
|
63
|
Kang C, Huo Y, Xin L, Tian B, Yu B. Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. J Theor Biol 2019; 463:77-91. [DOI: 10.1016/j.jtbi.2018.12.010] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2018] [Revised: 11/03/2018] [Accepted: 12/06/2018] [Indexed: 02/08/2023]
|
64
|
Abstract
Recently, many neural network models have been successfully applied for histopathological analysis, including for cancer classifications. While some of them reach human–expert level accuracy in classifying cancers, most of them have to be treated as black box, in which they do not offer explanation on how they arrived at their decisions. This lack of transparency may hinder the further applications of neural networks in realistic clinical settings where not only decision but also explainability is important. This study proposes a transparent neural network that complements its classification decisions with visual information about the given problem. The auxiliary visual information allows the user to some extent understand how the neural network arrives at its decision. The transparency potentially increases the usability of neural networks in realistic histopathological analysis. In the experiment, the accuracy of the proposed neural network is compared against some existing classifiers, and the visual information is compared against some dimensional reduction methods.
Collapse
|
65
|
Mafarja M, Aljarah I, Heidari AA, Faris H, Fournier-Viger P, Li X, Mirjalili S. Binary dragonfly optimization for feature selection using time-varying transfer functions. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.08.003] [Citation(s) in RCA: 246] [Impact Index Per Article: 35.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
66
|
Caglar MU, Hockenberry AJ, Wilke CO. Predicting bacterial growth conditions from mRNA and protein abundances. PLoS One 2018; 13:e0206634. [PMID: 30388153 PMCID: PMC6214550 DOI: 10.1371/journal.pone.0206634] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2018] [Accepted: 10/16/2018] [Indexed: 01/30/2023] Open
Abstract
Cells respond to changing nutrient availability and external stresses by altering the expression of individual genes. Condition-specific gene expression patterns may thus provide a promising and low-cost route to quantifying the presence of various small molecules, toxins, or species-interactions in natural environments. However, whether gene expression signatures alone can predict individual environmental growth conditions remains an open question. Here, we used machine learning to predict 16 closely-related growth conditions using 155 datasets of E. coli transcript and protein abundances. We show that models are able to discriminate between different environmental features with a relatively high degree of accuracy. We observed a small but significant increase in model accuracy by combining transcriptome and proteome-level data, and we show that measurements from stationary phase cells typically provide less useful information for discriminating between conditions as compared to exponentially growing populations. Nevertheless, with sufficient training data, gene expression measurements from a single species are capable of distinguishing between environmental conditions that are separated by a single environmental variable.
Collapse
Affiliation(s)
- M. Umut Caglar
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas, United States of America
| | - Adam J. Hockenberry
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas, United States of America
| | - Claus O. Wilke
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas, United States of America
- * E-mail:
| |
Collapse
|
67
|
Yoo TK, Choi JY, Seo JG, Ramasubramanian B, Selvaperumal S, Kim DW. The possibility of the combination of OCT and fundus images for improving the diagnostic accuracy of deep learning for age-related macular degeneration: a preliminary experiment. Med Biol Eng Comput 2018; 57:677-687. [DOI: 10.1007/s11517-018-1915-z] [Citation(s) in RCA: 71] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2017] [Accepted: 10/09/2018] [Indexed: 12/23/2022]
|
68
|
Armañanzas R. Revealing post-transcriptional microRNA-mRNA regulations in Alzheimer's disease through ensemble graphs. BMC Genomics 2018; 19:668. [PMID: 30255799 PMCID: PMC6157163 DOI: 10.1186/s12864-018-5025-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND In silico investigations on the integration of multiple datasets are in need of higher statistical power methods to unveil secondary findings that were hidden from the initial analyses. We present here a novel method for the network analysis of messenger RNA post-translational regulation by microRNA molecules. The method integrates expression data and sequence binding predictions through a set of sound machine learning techniques, forwarding all results to an ensemble graph of regulations. RESULTS Bayesian network classifiers are induced based on a pool of ensemble graphs with ascending order of complexity. Individual goodness-of-fit and classification performances are evaluated for each learned model. As a testbed, four Alzheimer's disease datasets are integrated using the new approach, achieving top values of 0.9794 ± 0.01 for the area under the receiver operating characteristic curve and 0.9439 ± 0.0234 for the prediction accuracy. CONCLUSIONS Post-transcriptional regulations found by the optimal network classifier concur with previous literature findings. Furthermore, additional network structures suggest previously unreported regulations in the state of the art of Alzheimer's research. The quantitative performance as well as sound biological findings provide confidence in the ensemble approach and encourage similar integrative analyses for other conditions.
Collapse
Affiliation(s)
- Rubén Armañanzas
- Department of Bioengineering, Krasnow Institute for Advanced Study, George Mason University, 4400 University Dr, MS2A1, Fairfax, 22030, VA, USA.
| |
Collapse
|
69
|
Jadhav S, He H, Jenkins K. Information gain directed genetic algorithm wrapper feature selection for credit rating. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2018.04.033] [Citation(s) in RCA: 140] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
70
|
Tsamardinos I, Greasidou E, Borboudakis G. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Mach Learn 2018; 107:1895-1922. [PMID: 30393425 PMCID: PMC6191021 DOI: 10.1007/s10994-018-5714-4] [Citation(s) in RCA: 84] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2017] [Accepted: 04/21/2018] [Indexed: 12/26/2022]
Abstract
Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822-829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.
Collapse
Affiliation(s)
- Ioannis Tsamardinos
- Computer Science Department, University of Crete and Gnosis Data Analysis PC, Heraklion, Greece
| | - Elissavet Greasidou
- Computer Science Department, University of Crete and Gnosis Data Analysis PC, Heraklion, Greece
| | - Giorgos Borboudakis
- Computer Science Department, University of Crete and Gnosis Data Analysis PC, Heraklion, Greece
| |
Collapse
|
71
|
Affiliation(s)
- Meng Pan
- Department of Optoelectronic Engineering, College of Science and Engineering, Jinan University, Guangzhou, Guangdong, PR China
| | - Jie Zhang
- Department of Physics, College of Science and Engineering, Jinan University, Guangzhou, Guangdong, PR China
| |
Collapse
|
72
|
Wang A, An N, Chen G, Liu L, Alterovitz G. Subtype dependent biomarker identification and tumor classification from gene expression profiles. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.01.025] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
73
|
|
74
|
Peeken JC, Goldberg T, Knie C, Komboz B, Bernhofer M, Pasa F, Kessel KA, Tafti PD, Rost B, Nüsslin F, Braun AE, Combs SE. Treatment-related features improve machine learning prediction of prognosis in soft tissue sarcoma patients. Strahlenther Onkol 2018; 194:824-834. [PMID: 29557486 DOI: 10.1007/s00066-018-1294-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Accepted: 03/05/2018] [Indexed: 12/01/2022]
Abstract
BACKGROUND AND PURPOSE Current prognostic models for soft tissue sarcoma (STS) patients are solely based on staging information. Treatment-related data have not been included to date. Including such information, however, could help to improve these models. MATERIALS AND METHODS A single-center retrospective cohort of 136 STS patients treated with radiotherapy (RT) was analyzed for patients' characteristics, staging information, and treatment-related data. Therapeutic imaging studies and pathology reports of neoadjuvantly treated patients were analyzed for signs of response. Random forest machine learning-based models were used to predict patients' death and disease progression at 2 years. Pre-treatment and treatment models were compared. RESULTS The prognostic models achieved high performances. Using treatment features improved the overall performance for all three classification types: prediction of death, and of local and systemic progression (area under the receiver operatoring characteristic curve (AUC) of 0.87, 0.88, and 0.84, respectively). Overall, RT-related features, such as the planning target volume and total dose, had preeminent importance for prognostic performance. Therapy response features were selected for prediction of disease progression. CONCLUSIONS A machine learning-based prognostic model combining known prognostic factors with treatment- and response-related information showed high accuracy for individualized risk assessment. This model could be used for adjustments of follow-up procedures.
Collapse
Affiliation(s)
- Jan C Peeken
- Department of Radiation Oncology, Klinikum rechts der Isar, Technical University of Munich (TUM), Ismaninger Straße 22, 81675, Munich, Germany. .,Partner Site Munich, Deutsches Konsortium für Translationale Krebsforschung (DKTK), Munich, Germany.
| | | | - Christoph Knie
- Department of Radiation Oncology, Klinikum rechts der Isar, Technical University of Munich (TUM), Ismaninger Straße 22, 81675, Munich, Germany
| | - Basil Komboz
- Allianz SE, Königinstraße 28, 80802, Munich, Germany
| | - Michael Bernhofer
- Department for Bioinformatics and Computational Biology, Informatik 12, Technical University of Munich (TUM), Boltzmannstraße 3, 85748, Garching, Germany
| | - Francesco Pasa
- Department of Computer Science, Informatik 9, Technical University of Munich (TUM), Boltzmannstraße 3, 85748, Garching, Germany.,Chair of Biomedical Physics, Department of Physics, Technical University of Munich (TUM), James-Franck-Straße 1, 85748, Garching, Germany
| | - Kerstin A Kessel
- Department of Radiation Oncology, Klinikum rechts der Isar, Technical University of Munich (TUM), Ismaninger Straße 22, 81675, Munich, Germany.,Institute of Innovative Radiotherapy (iRT), Department of Radiation Sciences (DRS), Helmholtz Zentrum München, Ingolstaedter Landstraße 1, 85764, Neuherberg, Germany.,Partner Site Munich, Deutsches Konsortium für Translationale Krebsforschung (DKTK), Munich, Germany
| | - Pouya D Tafti
- Allianz SE, Königinstraße 28, 80802, Munich, Germany
| | - Burkhard Rost
- Department for Bioinformatics and Computational Biology, Informatik 12, Technical University of Munich (TUM), Boltzmannstraße 3, 85748, Garching, Germany
| | - Fridtjof Nüsslin
- Department of Radiation Oncology, Klinikum rechts der Isar, Technical University of Munich (TUM), Ismaninger Straße 22, 81675, Munich, Germany
| | | | - Stephanie E Combs
- Department of Radiation Oncology, Klinikum rechts der Isar, Technical University of Munich (TUM), Ismaninger Straße 22, 81675, Munich, Germany.,Institute of Innovative Radiotherapy (iRT), Department of Radiation Sciences (DRS), Helmholtz Zentrum München, Ingolstaedter Landstraße 1, 85764, Neuherberg, Germany.,Partner Site Munich, Deutsches Konsortium für Translationale Krebsforschung (DKTK), Munich, Germany
| |
Collapse
|
75
|
Gabryś HS, Buettner F, Sterzing F, Hauswald H, Bangert M. Design and Selection of Machine Learning Methods Using Radiomics and Dosiomics for Normal Tissue Complication Probability Modeling of Xerostomia. Front Oncol 2018; 8:35. [PMID: 29556480 PMCID: PMC5844945 DOI: 10.3389/fonc.2018.00035] [Citation(s) in RCA: 112] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2017] [Accepted: 02/01/2018] [Indexed: 01/13/2023] Open
Abstract
Purpose The purpose of this study is to investigate whether machine learning with dosiomic, radiomic, and demographic features allows for xerostomia risk assessment more precise than normal tissue complication probability (NTCP) models based on the mean radiation dose to parotid glands. Material and methods A cohort of 153 head-and-neck cancer patients was used to model xerostomia at 0–6 months (early), 6–15 months (late), 15–24 months (long-term), and at any time (a longitudinal model) after radiotherapy. Predictive power of the features was evaluated by the area under the receiver operating characteristic curve (AUC) of univariate logistic regression models. The multivariate NTCP models were tuned and tested with single and nested cross-validation, respectively. We compared predictive performance of seven classification algorithms, six feature selection methods, and ten data cleaning/class balancing techniques using the Friedman test and the Nemenyi post hoc analysis. Results NTCP models based on the parotid mean dose failed to predict xerostomia (AUCs < 0.60). The most informative predictors were found for late and long-term xerostomia. Late xerostomia correlated with the contralateral dose gradient in the anterior–posterior (AUC = 0.72) and the right–left (AUC = 0.68) direction, whereas long-term xerostomia was associated with parotid volumes (AUCs > 0.85), dose gradients in the right–left (AUCs > 0.78), and the anterior–posterior (AUCs > 0.72) direction. Multivariate models of long-term xerostomia were typically based on the parotid volume, the parotid eccentricity, and the dose–volume histogram (DVH) spread with the generalization AUCs ranging from 0.74 to 0.88. On average, support vector machines and extra-trees were the top performing classifiers, whereas the algorithms based on logistic regression were the best choice for feature selection. We found no advantage in using data cleaning or class balancing methods. Conclusion We demonstrated that incorporation of organ- and dose-shape descriptors is beneficial for xerostomia prediction in highly conformal radiotherapy treatments. Due to strong reliance on patient-specific, dose-independent factors, our results underscore the need for development of personalized data-driven risk profiles for NTCP models of xerostomia. The facilitated machine learning pipeline is described in detail and can serve as a valuable reference for future work in radiomic and dosiomic NTCP modeling.
Collapse
Affiliation(s)
- Hubert S Gabryś
- Department of Medical Physics in Radiation Oncology, German Cancer Research Center (DKFZ), Heidelberg, Germany.,Medical Faculty of Heidelberg, Heidelberg University, Heidelberg, Germany.,Heidelberg Institute for Radiation Oncology (HIRO), Heidelberg, Germany
| | - Florian Buettner
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany
| | - Florian Sterzing
- Heidelberg Institute for Radiation Oncology (HIRO), Heidelberg, Germany.,Clinical Cooperation Unit Radiation Oncology, German Cancer Research Center (DKFZ), Heidelberg, Germany.,Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany
| | - Henrik Hauswald
- Heidelberg Institute for Radiation Oncology (HIRO), Heidelberg, Germany.,Clinical Cooperation Unit Radiation Oncology, German Cancer Research Center (DKFZ), Heidelberg, Germany.,Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany
| | - Mark Bangert
- Department of Medical Physics in Radiation Oncology, German Cancer Research Center (DKFZ), Heidelberg, Germany.,Heidelberg Institute for Radiation Oncology (HIRO), Heidelberg, Germany
| |
Collapse
|
76
|
Mazumdar H, Kim TH, Lee JM, Ha JH, Ahrberg CD, Chung BG. Prediction analysis and quality assessment of microwell array images. Electrophoresis 2018; 39:948-956. [PMID: 29323408 DOI: 10.1002/elps.201700460] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 12/29/2017] [Accepted: 12/29/2017] [Indexed: 11/11/2022]
Abstract
Microwell arrays are widely used for the analysis of fluorescent-labelled biomaterials. For rapid detection and automated analysis of microwell arrays, the computational image analysis is required. Support Vector Machines (SVM) can be used for this task. Here, we present a SVM-based approach for the analysis of microwell arrays consisting of three distinct steps: labeling, training for feature selection, and classification into three classes. The three classes are filled, partially filled, and unfilled microwells. Next, the partially filled wells are analyzed by SVM and their tendency towards filled or unfilled tested through applying a Gaussian filter. Through this, all microwells can be categorized as either filled or unfilled by our algorithm. Therefore, this SVM-based computational image analysis allows for an accurate and simple classification of microwell arrays.
Collapse
Affiliation(s)
- Hirak Mazumdar
- Department of Biomedical Engineering, Sogang University, Seoul, Republic of Korea
| | - Tae Hyeon Kim
- Department of Mechanical Engineering, Sogang University, Seoul, Republic of Korea
| | - Jong Min Lee
- Department of Mechanical Engineering, Sogang University, Seoul, Republic of Korea
| | - Jang Ho Ha
- Department of Mechanical Engineering, Sogang University, Seoul, Republic of Korea
| | - Christian D Ahrberg
- Department of Mechanical Engineering, Sogang University, Seoul, Republic of Korea
| | - Bong Geun Chung
- Department of Mechanical Engineering, Sogang University, Seoul, Republic of Korea
| |
Collapse
|
77
|
Bi-stage hierarchical selection of pathway genes for cancer progression using a swarm based computational approach. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.10.024] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
78
|
Gene selection from large-scale gene expression data based on fuzzy interactive multi-objective binary optimization for medical diagnosis. Biocybern Biomed Eng 2018. [DOI: 10.1016/j.bbe.2018.02.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
79
|
Mohammed A, Biegert G, Adamec J, Helikar T. CancerDiscover: an integrative pipeline for cancer biomarker and cancer class prediction from high-throughput sequencing data. Oncotarget 2017; 9:2565-2573. [PMID: 29416792 PMCID: PMC5788660 DOI: 10.18632/oncotarget.23511] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Accepted: 12/09/2017] [Indexed: 11/25/2022] Open
Abstract
Accurate identification of cancer biomarkers and classification of cancer type and subtype from High Throughput Sequencing (HTS) data is a challenging problem because it requires manual processing of raw HTS data from various sequencing platforms, quality control, and normalization, which are both tedious and time-consuming. Machine learning techniques for cancer class prediction and biomarker discovery can hasten cancer detection and significantly improve prognosis. To date, great research efforts have been taken for cancer biomarker identification and cancer class prediction. However, currently available tools and pipelines lack flexibility in data preprocessing, running multiple feature selection methods and learning algorithms, therefore, developing a freely available and easy-to-use program is strongly demanded by researchers. Here, we propose CancerDiscover, an integrative open-source software pipeline that allows users to automatically and efficiently process large high-throughput raw datasets, normalize, and selects best performing features from multiple feature selection algorithms. Additionally, the integrative pipeline lets users apply different feature thresholds to identify cancer biomarkers and build various training models to distinguish different types and subtypes of cancer. The open-source software is available at https://github.com/HelikarLab/CancerDiscover and is free for use under the GPL3 license.
Collapse
Affiliation(s)
- Akram Mohammed
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
| | - Greyson Biegert
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
| | - Jiri Adamec
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
| | - Tomáš Helikar
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
| |
Collapse
|
80
|
Collaborative representation-based classification of microarray gene expression data. PLoS One 2017; 12:e0189533. [PMID: 29236759 PMCID: PMC5728509 DOI: 10.1371/journal.pone.0189533] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2017] [Accepted: 11/27/2017] [Indexed: 11/19/2022] Open
Abstract
Microarray technology is important to simultaneously express multiple genes over a number of time points. Multiple classifier models, such as sparse representation (SR)-based method, have been developed to classify microarray gene expression data. These methods allocate the gene data points to different clusters. In this paper, we propose a novel collaborative representation (CR)-based classification with regularized least square to classify gene data. First, the CR codes a testing sample as a sparse linear combination of all training samples and then classifies the testing sample by evaluating which class leads to the minimum representation error. This CR-based classification approach is remarkably less complex than traditional classification methods but leads to very competitive classification results. In addition, compressive sensing approach is adopted to project the high-dimensional gene expression dataset to a lower-dimensional space which nearly contains the whole information. This compression without loss is beneficial to reduce the computational load. Experiments to detect subtypes of diseases, such as leukemia and autism spectrum disorders, are performed by analyzing the gene expression. The results show that the proposed CR-based algorithm exhibits significantly higher stability and accuracy than the traditional classifiers, such as support vector machine algorithm.
Collapse
|
81
|
Sutton EJ, Huang EP, Drukker K, Burnside ES, Li H, Net JM, Rao A, Whitman GJ, Zuley M, Ganott M, Bonaccio E, Giger ML, Morris EA. Breast MRI radiomics: comparison of computer- and human-extracted imaging phenotypes. Eur Radiol Exp 2017; 1:22. [PMID: 29708200 PMCID: PMC5909355 DOI: 10.1186/s41747-017-0025-2] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Accepted: 09/19/2017] [Indexed: 01/18/2023] Open
Abstract
Background In this study, we sought to investigate if computer-extracted magnetic resonance imaging (MRI) phenotypes of breast cancer could replicate human-extracted size and Breast Imaging-Reporting and Data System (BI-RADS) imaging phenotypes using MRI data from The Cancer Genome Atlas (TCGA) project of the National Cancer Institute. Methods Our retrospective interpretation study involved analysis of Health Insurance Portability and Accountability Act-compliant breast MRI data from The Cancer Imaging Archive, an open-source database from the TCGA project. This study was exempt from institutional review board approval at Memorial Sloan Kettering Cancer Center and the need for informed consent was waived. Ninety-one pre-operative breast MRIs with verified invasive breast cancers were analysed. Three fellowship-trained breast radiologists evaluated the index cancer in each case according to size and the BI-RADS lexicon for shape, margin, and enhancement (human-extracted image phenotypes [HEIP]). Human inter-observer agreement was analysed by the intra-class correlation coefficient (ICC) for size and Krippendorff’s α for other measurements. Quantitative MRI radiomics of computerised three-dimensional segmentations of each cancer generated computer-extracted image phenotypes (CEIP). Spearman’s rank correlation coefficients were used to compare HEIP and CEIP. Results Inter-observer agreement for HEIP varied, with the highest agreement seen for size (ICC 0.679) and shape (ICC 0.527). The computer-extracted maximum linear size replicated the human measurement with p < 10−12. CEIP of shape, specifically sphericity and irregularity, replicated HEIP with both p values < 0.001. CEIP did not demonstrate agreement with HEIP of tumour margin or internal enhancement. Conclusions Quantitative radiomics of breast cancer may replicate human-extracted tumour size and BI-RADS imaging phenotypes, thus enabling precision medicine.
Collapse
Affiliation(s)
- Elizabeth J Sutton
- 1Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY 10065 USA
| | - Erich P Huang
- 2Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20892 USA
| | - Karen Drukker
- 3Department of Radiology, University of Chicago, 5841 South Maryland Avenue, MC 2026, Chicago, IL 60637 USA
| | - Elizabeth S Burnside
- 4Department of Radiology, University of Wisconsin School of Medicine and Public Health, 600 Highland Avenue, Madison, WI 53792 USA
| | - Hui Li
- 3Department of Radiology, University of Chicago, 5841 South Maryland Avenue, MC 2026, Chicago, IL 60637 USA
| | - Jose M Net
- 5Miller School of Medicine, University of Miami, Miami, FL 33136 USA
| | - Arvind Rao
- 6Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77498 USA
| | - Gary J Whitman
- 7Department of Diagnostic Imaging, The University of Texas MD Anderson Cancer, Center, Houston, TX 77030 USA
| | - Margarita Zuley
- 8Department of Radiology, University of Pittsburgh, Pittsburgh, PA 15213 USA
| | - Marie Ganott
- 8Department of Radiology, University of Pittsburgh, Pittsburgh, PA 15213 USA
| | - Ermelinda Bonaccio
- 9Department of Radiology, Roswell Park Cancer Institute, Buffalo, NY 14263 USA
| | - Maryellen L Giger
- 3Department of Radiology, University of Chicago, 5841 South Maryland Avenue, MC 2026, Chicago, IL 60637 USA
| | - Elizabeth A Morris
- 1Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY 10065 USA.,300 East 66th Street, New York, NY 10065 USA
| | | |
Collapse
|
82
|
Choi JY, Yoo TK, Seo JG, Kwak J, Um TT, Rim TH. Multi-categorical deep learning neural network to classify retinal images: A pilot study employing small database. PLoS One 2017; 12:e0187336. [PMID: 29095872 PMCID: PMC5667846 DOI: 10.1371/journal.pone.0187336] [Citation(s) in RCA: 117] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2017] [Accepted: 10/18/2017] [Indexed: 01/03/2023] Open
Abstract
Deep learning emerges as a powerful tool for analyzing medical images. Retinal disease detection by using computer-aided diagnosis from fundus image has emerged as a new method. We applied deep learning convolutional neural network by using MatConvNet for an automated detection of multiple retinal diseases with fundus photographs involved in STructured Analysis of the REtina (STARE) database. Dataset was built by expanding data on 10 categories, including normal retina and nine retinal diseases. The optimal outcomes were acquired by using a random forest transfer learning based on VGG-19 architecture. The classification results depended greatly on the number of categories. As the number of categories increased, the performance of deep learning models was diminished. When all 10 categories were included, we obtained results with an accuracy of 30.5%, relative classifier information (RCI) of 0.052, and Cohen's kappa of 0.224. Considering three integrated normal, background diabetic retinopathy, and dry age-related macular degeneration, the multi-categorical classifier showed accuracy of 72.8%, 0.283 RCI, and 0.577 kappa. In addition, several ensemble classifiers enhanced the multi-categorical classification performance. The transfer learning incorporated with ensemble classifier of clustering and voting approach presented the best performance with accuracy of 36.7%, 0.053 RCI, and 0.225 kappa in the 10 retinal diseases classification problem. First, due to the small size of datasets, the deep learning techniques in this study were ineffective to be applied in clinics where numerous patients suffering from various types of retinal disorders visit for diagnosis and treatment. Second, we found that the transfer learning incorporated with ensemble classifiers can improve the classification performance in order to detect multi-categorical retinal diseases. Further studies should confirm the effectiveness of algorithms with large datasets obtained from hospitals.
Collapse
Affiliation(s)
- Joon Yul Choi
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
| | - Tae Keun Yoo
- Institute of Vision Research, Department of Ophthalmology, Yonsei University College of Medicine, Seoul, South Korea
| | - Jeong Gi Seo
- Institute of Vision Research, Department of Ophthalmology, Yonsei University College of Medicine, Seoul, South Korea
| | - Jiyong Kwak
- Institute of Vision Research, Department of Ophthalmology, Yonsei University College of Medicine, Seoul, South Korea
| | - Terry Taewoong Um
- Department of Electrical & Computer Engineering, University of Waterloo, Waterloo, Ontario, Canada
| | - Tyler Hyungtaek Rim
- Institute of Vision Research, Department of Ophthalmology, Yonsei University College of Medicine, Seoul, South Korea
| |
Collapse
|
83
|
Yu K, Wu X, Ding W, Mu Y, Wang H. Markov Blanket Feature Selection Using Representative Sets. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:2775-2788. [PMID: 28113384 DOI: 10.1109/tnnls.2016.2602365] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
It has received much attention in recent years to use Markov blankets in a Bayesian network for feature selection. The Markov blanket of a class attribute in a Bayesian network is a unique yet minimal feature subset for optimal feature selection if the probability distribution of a data set can be faithfully represented by this Bayesian network. However, if a data set violates the faithful condition, Markov blankets of a class attribute may not be unique. To tackle this issue, in this paper, we propose a new concept of representative sets and then design the selection via group alpha-investing (SGAI) algorithm to perform Markov blanket feature selection with representative sets for classification. Using a comprehensive set of real data, our empirical studies have demonstrated that SGAI outperforms the state-of-the-art Markov blanket feature selectors and other well-established feature selection methods.It has received much attention in recent years to use Markov blankets in a Bayesian network for feature selection. The Markov blanket of a class attribute in a Bayesian network is a unique yet minimal feature subset for optimal feature selection if the probability distribution of a data set can be faithfully represented by this Bayesian network. However, if a data set violates the faithful condition, Markov blankets of a class attribute may not be unique. To tackle this issue, in this paper, we propose a new concept of representative sets and then design the selection via group alpha-investing (SGAI) algorithm to perform Markov blanket feature selection with representative sets for classification. Using a comprehensive set of real data, our empirical studies have demonstrated that SGAI outperforms the state-of-the-art Markov blanket feature selectors and other well-established feature selection methods.
Collapse
Affiliation(s)
- Kui Yu
- School of Information Technology and Mathematical Sciences, University of South Australia, Adelaide, SA, Australia
| | - Xindong Wu
- School of Computing and Informatics, University of Louisiana, Lafayette, LA, USA
| | - Wei Ding
- Department of Computer Science, University of Massachusetts Boston, Boston, MA, USA
| | - Yang Mu
- Department of Computer Science, University of Massachusetts Boston, Boston, MA, USA
| | - Hao Wang
- Department of Computer Science, Hefei University of Technology, Hefei, China
| |
Collapse
|
84
|
Aliferis CF, Statnikov A, Tsamardinos I. Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective. Cancer Inform 2017. [DOI: 10.1177/117693510600200004] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Sound data analysis is critical to the success of modern molecular medicine research that involves collection and interpretation of mass-throughput data. The novel nature and high-dimensionality in such datasets pose a series of non-trivial data analysis problems. This technical commentary discusses the problems of over-fitting, error estimation, curse of dimensionality, causal versus predictive modeling, integration of heterogeneous types of data, and lack of standard protocols for data analysis. We attempt to shed light on the nature and causes of these problems and to outline viable methodological approaches to overcome them.
Collapse
Affiliation(s)
- Constantin F. Aliferis
- Discovery Systems Laboratory, Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
- Department of Cancer Biology, Vanderbilt University, Nashville, TN, USA
| | - Alexander Statnikov
- Discovery Systems Laboratory, Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| | - Ioannis Tsamardinos
- Discovery Systems Laboratory, Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| |
Collapse
|
85
|
Deng X, Geng H, Ali HH. Cross-platform Analysis of Cancer Biomarkers: A Bayesian Network Approach to Incorporating Mass Spectrometry and Microarray Data. Cancer Inform 2017. [DOI: 10.1177/117693510700300001] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Many studies showed inconsistent cancer biomarkers due to bioinformatics artifacts. In this paper we use multiple data sets from microarrays, mass spectrometry, protein sequences, and other biological knowledge in order to improve the reliability of cancer biomarkers. We present a novel Bayesian network (BN) model which integrates and cross-annotates multiple data sets related to prostate cancer. The main contribution of this study is that we provide a method that is designed to find cancer biomarkers whose presence is supported by multiple data sources and biological knowledge. Relevant biological knowledge is explicitly encoded into the model parameters, and the biomarker finding problem is formulated as a Bayesian inference problem. Besides diagnostic accuracy, we introduce reliability as another quality measurement of the biological relevance of biomarkers. Based on the proposed BN model, we develop an empirical scoring scheme and a simulation algorithm for inferring biomarkers. Fourteen genes/proteins including prostate specific antigen (PSA) are identified as reliable serum biomarkers which are insensitive to the model assumptions. The computational results show that our method is able to find biologically relevant biomarkers with highest reliability while maintaining competitive predictive power. In addition, by combining biological knowledge and data from multiple platforms, the number of putative biomarkers is greatly reduced to allow more-focused clinical studies.
Collapse
Affiliation(s)
- Xutao Deng
- College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182, U.S.A
| | - Huimin Geng
- Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha, NE 68198, U.S.A
| | - Hesham H. Ali
- College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182, U.S.A
| |
Collapse
|
86
|
Mohammed A, Biegert G, Adamec J, Helikar T. Identification of potential tissue-specific cancer biomarkers and development of cancer versus normal genomic classifiers. Oncotarget 2017; 8:85692-85715. [PMID: 29156751 PMCID: PMC5689641 DOI: 10.18632/oncotarget.21127] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Accepted: 09/05/2017] [Indexed: 01/15/2023] Open
Abstract
Machine learning techniques for cancer prediction and biomarker discovery can hasten cancer detection and significantly improve prognosis. Recent “OMICS” studies which include a variety of cancer and normal tissue samples along with machine learning approaches have the potential to further accelerate such discovery. To demonstrate this potential, 2,175 gene expression samples from nine tissue types were obtained to identify gene sets whose expression is characteristic of each cancer class. Using random forests classification and ten-fold cross-validation, we developed nine single-tissue classifiers, two multi-tissue cancer-versus-normal classifiers, and one multi-tissue normal classifier. Given a sample of a specified tissue type, the single-tissue models classified samples as cancer or normal with a testing accuracy between 85.29% and 100%. Given a sample of non-specific tissue type, the multi-tissue bi-class model classified the sample as cancer versus normal with a testing accuracy of 97.89%. Given a sample of non-specific tissue type, the multi-tissue multi-class model classified the sample as cancer versus normal and as a specific tissue type with a testing accuracy of 97.43%. Given a normal sample of any of the nine tissue types, the multi-tissue normal model classified the sample as a particular tissue type with a testing accuracy of 97.35%. The machine learning classifiers developed in this study identify potential cancer biomarkers with sensitivity and specificity that exceed those of existing biomarkers and pointed to pathways that are critical to tissue-specific tumor development. This study demonstrates the feasibility of predicting the tissue origin of carcinoma in the context of multiple cancer classes.
Collapse
Affiliation(s)
- Akram Mohammed
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, USA
| | - Greyson Biegert
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, USA
| | - Jiri Adamec
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, USA
| | - Tomáš Helikar
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, USA
| |
Collapse
|
87
|
Boulesteix AL, Wilson R, Hapfelmeier A. Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies. BMC Med Res Methodol 2017; 17:138. [PMID: 28888225 PMCID: PMC5591542 DOI: 10.1186/s12874-017-0417-2] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 08/31/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly "evidence-based". Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. MAIN MESSAGE In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of "evidence-based" statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. CONCLUSION We suggest that benchmark studies-a method of assessment of statistical methods using real-world datasets-might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany.
| | - Rory Wilson
- Institute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Alexander Hapfelmeier
- Institute of Medical Statistics and Epidemiology, Technical University Munich, Ismaninger Str. 22, Munich, 81675, Germany
| |
Collapse
|
88
|
Duan F, Xu Y. Applying Multivariate Adaptive Splines to Identify Genes With Expressions Varying After Diagnosis in Microarray Experiments. Cancer Inform 2017; 16:1176935117705381. [PMID: 28579740 PMCID: PMC5422340 DOI: 10.1177/1176935117705381] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2014] [Accepted: 02/20/2017] [Indexed: 12/17/2022] Open
Abstract
PURPOSE To analyze a microarray experiment to identify the genes with expressions varying after the diagnosis of breast cancer. METHODS A total of 44 928 probe sets in an Affymetrix microarray data publicly available on Gene Expression Omnibus from 249 patients with breast cancer were analyzed by the nonparametric multivariate adaptive splines. Then, the identified genes with turning points were grouped by K-means clustering, and their network relationship was subsequently analyzed by the Ingenuity Pathway Analysis. RESULTS In total, 1640 probe sets (genes) were reliably identified to have turning points along with the age at diagnosis in their expression profiling, of which 927 expressed lower after turning points and 713 expressed higher after the turning points. K-means clustered them into 3 groups with turning points centering at 54, 62.5, and 72, respectively. The pathway analysis showed that the identified genes were actively involved in various cancer-related functions or networks. CONCLUSIONS In this article, we applied the nonparametric multivariate adaptive splines method to a publicly available gene expression data and successfully identified genes with expressions varying before and after breast cancer diagnosis.
Collapse
Affiliation(s)
- Fenghai Duan
- Department of Biostatistics and Center for Statistical Sciences, School of Public Health, Brown University, Providence, RI, USA
| | - Ye Xu
- StubHub, San Francisco, CA, USA
| |
Collapse
|
89
|
Urda D, Luque-Baena RM, Franco L, Jerez JM, Sanchez-Marono N. Machine learning models to search relevant genetic signatures in clinical context. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) 2017:1649-1656. [DOI: 10.1109/ijcnn.2017.7966049] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2025]
|
90
|
Robust Microbiota-Based Diagnostics for Inflammatory Bowel Disease. J Clin Microbiol 2017; 55:1720-1732. [PMID: 28330889 DOI: 10.1128/jcm.00162-17] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2017] [Accepted: 03/15/2017] [Indexed: 01/11/2023] Open
Abstract
Strong evidence suggests that the gut microbiota is altered in inflammatory bowel disease (IBD), indicating its potential role in noninvasive diagnostics. However, no clinical applications are currently used for routine patient care. The main obstacle to implementing a gut microbiota test for IBD is the lack of standardization, which leads to high interlaboratory variation. We studied the between-hospital and between-platform batch effects and their effects on predictive accuracy for IBD. Fecal samples from 91 pediatric IBD patients and 58 healthy children were collected. IS-pro, a standardized technique designed for routine microbiota profiling in clinical settings, was used for microbiota composition characterization. Additionally, a large synthetic data set was used to simulate various perturbations and study their effects on the accuracy of different classifiers. Perturbations were validated in two replicate data sets, one processed in another laboratory and the other with a different analysis platform. The type of perturbation determined its effect on predictive accuracy. Real-life perturbations induced by between-platform variation were significantly greater than those caused by between-laboratory variation. Random forest was found to be robust to both simulated and observed perturbations, even when these perturbations had a dramatic effect on other classifiers. It achieved high accuracy both when cross-validated within the same data set and when using data sets analyzed in different laboratories. Robust clinical predictions based on the gut microbiota can be performed even when samples are processed in different hospitals. This study contributes to the effort to develop a universal IBD test that would enable simple diagnostics and disease activity monitoring.
Collapse
|
91
|
Fei Y, Hu J, Li WQ, Wang W, Zong GQ. Artificial neural networks predict the incidence of portosplenomesenteric venous thrombosis in patients with acute pancreatitis. J Thromb Haemost 2017; 15:439-445. [PMID: 27960048 DOI: 10.1111/jth.13588] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2016] [Indexed: 12/18/2022]
Abstract
Essentials Predicting the occurrence of portosplenomesenteric vein thrombosis (PSMVT) is difficult. We studied 72 patients with acute pancreatitis. Artificial neural networks modeling was more accurate than logistic regression in predicting PSMVT. Additional predictive factors may be incorporated into artificial neural networks. SUMMARY Objective To construct and validate artificial neural networks (ANNs) for predicting the occurrence of portosplenomesenteric venous thrombosis (PSMVT) and compare the predictive ability of the ANNs with that of logistic regression. Methods The ANNs and logistic regression modeling were constructed using simple clinical and laboratory data of 72 acute pancreatitis (AP) patients. The ANNs and logistic modeling were first trained on 48 randomly chosen patients and validated on the remaining 24 patients. The accuracy and the performance characteristics were compared between these two approaches by SPSS17.0 software. Results The training set and validation set did not differ on any of the 11 variables. After training, the back propagation network training error converged to 1 × 10-20 , and it retained excellent pattern recognition ability. When the ANNs model was applied to the validation set, it revealed a sensitivity of 80%, specificity of 85.7%, a positive predictive value of 77.6% and negative predictive value of 90.7%. The accuracy was 83.3%. Differences could be found between ANNs modeling and logistic regression modeling in these parameters (10.0% [95% CI, -14.3 to 34.3%], 14.3% [95% CI, -8.6 to 37.2%], 15.7% [95% CI, -9.9 to 41.3%], 11.8% [95% CI, -8.2 to 31.8%], 22.6% [95% CI, -1.9 to 47.1%], respectively). When ANNs modeling was used to identify PSMVT, the area under receiver operating characteristic curve was 0.849 (95% CI, 0.807-0.901), which demonstrated better overall properties than logistic regression modeling (AUC = 0.716) (95% CI, 0.679-0.761). Conclusions ANNs modeling was a more accurate tool than logistic regression in predicting the occurrence of PSMVT following AP. More clinical factors or biomarkers may be incorporated into ANNs modeling to improve its predictive ability.
Collapse
Affiliation(s)
- Y Fei
- Surgical Intensive Care Unit (SICU), Department of General Surgery, Jinling Hospital, Medical School of Nanjing University, Nanjing, China
| | - J Hu
- School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing, China
| | - W-Q Li
- Surgical Intensive Care Unit (SICU), Department of General Surgery, Jinling Hospital, Medical School of Nanjing University, Nanjing, China
| | - W Wang
- Department of General Surgery, Bayi Hospital affiliated Nanjing University of Chinese Medicine/the 81st Hospital of P.L.A., Nanjing, China
| | - G-Q Zong
- Department of General Surgery, Bayi Hospital affiliated Nanjing University of Chinese Medicine/the 81st Hospital of P.L.A., Nanjing, China
| |
Collapse
|
92
|
FRBPSO: A Fuzzy Rule Based Binary PSO for Feature Selection. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES INDIA SECTION A-PHYSICAL SCIENCES 2017. [DOI: 10.1007/s40010-017-0347-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
93
|
Bari MG, Salekin S, Zhang JM. A Robust and Efficient Feature Selection Algorithm for Microarray Data. Mol Inform 2016; 36. [PMID: 28000384 DOI: 10.1002/minf.201600099] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2016] [Accepted: 11/21/2016] [Indexed: 12/20/2022]
Abstract
In the past decades, a few synergistic feature selection algorithms have been published, which includes Cooperative Index (CI) and K-Top Scoring Pair (k-TSP). These algorithms consider the synergistic behavior of features when they are included in a feature panel. Although promising results have been shown for these algorithms, there is lack of a comprehensive and fair comparison with other feature selection algorithms across a large number of microarray datasets in terms of classification accuracy and computational complexity. There is a need in evaluating their performance and reducing the complexity of such algorithms. We compared the performance of synergistic feature selection algorithms with 11 other commonly used algorithms based on 22 microarray gene expression binary class datasets. The evaluation confirms that synergistic algorithms such as CI and k-TSP will gradually increase the classification performance as more features are used in the classifiers. Also, in order to cut down computational cost, we proposed a new feature selection ranking score called Positive Synergy Index (PSI). Testing results show that features selected using PSI as well as synergistic feature selection algorithms provide better performance compared to with all other methods, while PSI has a computational complexity significantly lower than that of other synergistic algorithms.
Collapse
Affiliation(s)
- Mehrab Ghanat Bari
- Dept. of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN, 55905
| | - Sirajul Salekin
- Dept. of Electrical and Computer Engineering, The University of Texas as San Antonio, San Antonio, TX, 78249
| | - Jianqiu Michelle Zhang
- Dept. of Electrical and Computer Engineering, The University of Texas as San Antonio, San Antonio, TX, 78249
| |
Collapse
|
94
|
Lai CM, Yeh WC, Chang CY. Gene selection using information gain and improved simplified swarm optimization. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.08.089] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
95
|
Devi Arockia Vanitha C, Devaraj D, Venkatesulu M. Multiclass cancer diagnosis in microarray gene expression profile using mutual information and Support Vector Machine. INTELL DATA ANAL 2016. [DOI: 10.3233/ida-150203] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
| | - D. Devaraj
- Department of Electrical and Electronics Engineering, Kalasalingam Academy of Research and Education, Krishnankoil, Tamil Nadu, India
| | - M. Venkatesulu
- Department of Computer Applications, Kalasalingam Academy of Research and Education, Krishnankoil, Tamil Nadu, India
| |
Collapse
|
96
|
Prieto A, Prieto B, Ortigosa EM, Ros E, Pelayo F, Ortega J, Rojas I. Neural networks: An overview of early research, current frameworks and new challenges. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.06.014] [Citation(s) in RCA: 161] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|
97
|
Yang H, Seoighe C. Impact of the Choice of Normalization Method on Molecular Cancer Class Discovery Using Nonnegative Matrix Factorization. PLoS One 2016; 11:e0164880. [PMID: 27741311 PMCID: PMC5065197 DOI: 10.1371/journal.pone.0164880] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2016] [Accepted: 10/03/2016] [Indexed: 11/18/2022] Open
Abstract
Nonnegative Matrix Factorization (NMF) has proved to be an effective method for unsupervised clustering analysis of gene expression data. By the nonnegativity constraint, NMF provides a decomposition of the data matrix into two matrices that have been used for clustering analysis. However, the decomposition is not unique. This allows different clustering results to be obtained, resulting in different interpretations of the decomposition. To alleviate this problem, some existing methods directly enforce uniqueness to some extent by adding regularization terms in the NMF objective function. Alternatively, various normalization methods have been applied to the factor matrices; however, the effects of the choice of normalization have not been carefully investigated. Here we investigate the performance of NMF for the task of cancer class discovery, under a wide range of normalization choices. After extensive evaluations, we observe that the maximum norm showed the best performance, although the maximum norm has not previously been used for NMF. Matlab codes are freely available from: http://maths.nuigalway.ie/~haixuanyang/pNMF/pNMF.htm.
Collapse
Affiliation(s)
- Haixuan Yang
- School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Ireland
- * E-mail:
| | - Cathal Seoighe
- School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Ireland
| |
Collapse
|
98
|
Khondoker M, Dobson R, Skirrow C, Simmons A, Stahl D. A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies. Stat Methods Med Res 2016; 25:1804-1823. [PMID: 24047600 PMCID: PMC5081132 DOI: 10.1177/0962280213502437] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
BACKGROUND Recent literature on the comparison of machine learning methods has raised questions about the neutrality, unbiasedness and utility of many comparative studies. Reporting of results on favourable datasets and sampling error in the estimated performance measures based on single samples are thought to be the major sources of bias in such comparisons. Better performance in one or a few instances does not necessarily imply so on an average or on a population level and simulation studies may be a better alternative for objectively comparing the performances of machine learning algorithms. METHODS We compare the classification performance of a number of important and widely used machine learning algorithms, namely the Random Forests (RF), Support Vector Machines (SVM), Linear Discriminant Analysis (LDA) and k-Nearest Neighbour (kNN). Using massively parallel processing on high-performance supercomputers, we compare the generalisation errors at various combinations of levels of several factors: number of features, training sample size, biological variation, experimental variation, effect size, replication and correlation between features. RESULTS For smaller number of correlated features, number of features not exceeding approximately half the sample size, LDA was found to be the method of choice in terms of average generalisation errors as well as stability (precision) of error estimates. SVM (with RBF kernel) outperforms LDA as well as RF and kNN by a clear margin as the feature set gets larger provided the sample size is not too small (at least 20). The performance of kNN also improves as the number of features grows and outplays that of LDA and RF unless the data variability is too high and/or effect sizes are too small. RF was found to outperform only kNN in some instances where the data are more variable and have smaller effect sizes, in which cases it also provide more stable error estimates than kNN and LDA. Applications to a number of real datasets supported the findings from the simulation study.
Collapse
Affiliation(s)
- Mizanur Khondoker
- King's College London, Institute of Psychiatry, Department of Biostatistics, London, UK King's College London, Institute of Psychiatry, NIHR Biomedical Research Centre for Mental Health at the South London and Maudsley NHS Foundation Trust, London, UK
| | - Richard Dobson
- King's College London, Institute of Psychiatry, NIHR Biomedical Research Centre for Mental Health at the South London and Maudsley NHS Foundation Trust, London, UK King's College London, Institute of Psychiatry, NIHR Biomedical Research Unit for Dementia at the South London and Maudsley NHS Foundation Trust, London, UK
| | - Caroline Skirrow
- King's College London, Institute of Psychiatry, MRC Social, Genetic and Developmental Psychiatry Centre, UK
| | - Andrew Simmons
- King's College London, Institute of Psychiatry, NIHR Biomedical Research Centre for Mental Health at the South London and Maudsley NHS Foundation Trust, London, UK King's College London, Institute of Psychiatry, NIHR Biomedical Research Unit for Dementia at the South London and Maudsley NHS Foundation Trust, London, UK
| | - Daniel Stahl
- King's College London, Institute of Psychiatry, Department of Biostatistics, London, UK
| |
Collapse
|
99
|
Abstract
BACKGROUND Development of biologically relevant models from gene expression data notably, microarray data has become a topic of great interest in the field of bioinformatics and clinical genetics and oncology. Only a small number of gene expression data compared to the total number of genes explored possess a significant correlation with a certain phenotype. Gene selection enables researchers to obtain substantial insight into the genetic nature of the disease and the mechanisms responsible for it. Besides improvement of the performance of cancer classification, it can also cut down the time and cost of medical diagnoses. METHODS This study presents a modified Artificial Bee Colony Algorithm (ABC) to select minimum number of genes that are deemed to be significant for cancer along with improvement of predictive accuracy. The search equation of ABC is believed to be good at exploration but poor at exploitation. To overcome this limitation we have modified the ABC algorithm by incorporating the concept of pheromones which is one of the major components of Ant Colony Optimization (ACO) algorithm and a new operation in which successive bees communicate to share their findings. RESULTS The proposed algorithm is evaluated using a suite of ten publicly available datasets after the parameters are tuned scientifically with one of the datasets. Obtained results are compared to other works that used the same datasets. The performance of the proposed method is proved to be superior. CONCLUSION The method presented in this paper can provide subset of genes leading to more accurate classification results while the number of selected genes is smaller. Additionally, the proposed modified Artificial Bee Colony Algorithm could conceivably be applied to problems in other areas as well.
Collapse
Affiliation(s)
| | - Rameen Shakur
- Wellcome Trust - Medical Research Council Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
| | - Mohammad Kaykobad
- A ℓEDA Group, Department of CSE, BUET, Dhaka-1205, Dhaka, Bangladesh
| | | |
Collapse
|
100
|
Lovato P, Bicego M, Kesa M, Jojic N, Murino V, Perina A. Traveling on discrete embeddings of gene expression. Artif Intell Med 2016; 70:1-11. [PMID: 27431033 DOI: 10.1016/j.artmed.2016.05.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2015] [Revised: 05/20/2016] [Accepted: 05/21/2016] [Indexed: 12/24/2022]
Abstract
OBJECTIVE High-throughput technologies have generated an unprecedented amount of high-dimensional gene expression data. Algorithmic approaches could be extremely useful to distill information and derive compact interpretable representations of the statistical patterns present in the data. This paper proposes a mining approach to extract an informative representation of gene expression profiles based on a generative model called the Counting Grid (CG). METHOD Using the CG model, gene expression values are arranged on a discrete grid, learned in a way that "similar" co-expression patterns are arranged in close proximity, thus resulting in an intuitive visualization of the dataset. More than this, the model permits to identify the genes that distinguish between classes (e.g. different types of cancer). Finally, each sample can be characterized with a discriminative signature - extracted from the model - that can be effectively employed for classification. RESULTS A thorough evaluation on several gene expression datasets demonstrate the suitability of the proposed approach from a twofold perspective: numerically, we reached state-of-the-art classification accuracies on 5 datasets out of 7, and similar results when the approach is tested in a gene selection setting (with a stability always above 0.87); clinically, by confirming that many of the genes highlighted by the model as significant play also a key role for cancer biology. CONCLUSION The proposed framework can be successfully exploited to meaningfully visualize the samples; detect medically relevant genes; properly classify samples.
Collapse
Affiliation(s)
- Pietro Lovato
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134 Verona, Italy.
| | - Manuele Bicego
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134 Verona, Italy
| | - Maria Kesa
- Tallinn University of Technology, Ehitajate tee 5, 19086 Tallinn, Estonia
| | - Nebojsa Jojic
- Microsoft Research, One Microsoft Way, 98052 Redmond, WA, USA
| | - Vittorio Murino
- Pattern Analysis and Computer Vision (PAVIS), Istituto Italiano di Tecnologia (IIT), Via Morego 30, 16163 Genova, Italy
| | | |
Collapse
|