301
|
He H, Lin D, Zhang J, Wang Y, Deng HW. Biostatistics, Data Mining and Computational Modeling. TRANSLATIONAL BIOINFORMATICS 2016. [DOI: 10.1007/978-94-017-7543-4_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
302
|
Jung K. Statistical Aspects in Proteomic Biomarker Discovery. Methods Mol Biol 2016; 1362:293-310. [PMID: 26519185 DOI: 10.1007/978-1-4939-3106-4_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
In the pursuit of a personalized medicine, i.e., the individual treatment of a patient, many medical decision problems are desired to be supported by biomarkers that can help to make a diagnosis, prediction, or prognosis. Proteomic biomarkers are of special interest since they can not only be detected in tissue samples but can also often be easily detected in diverse body fluids. Statistical methods play an important role in the discovery and validation of proteomic biomarkers. They are necessary in the planning of experiments, in the processing of raw signals, and in the final data analysis. This review provides an overview on the most frequent experimental settings including sample size considerations, and focuses on exploratory data analysis and classifier development.
Collapse
Affiliation(s)
- Klaus Jung
- Department of Medical Statistics, Georg-August-University Göttingen, Humboldtallee 32, 37073, Göttingen, Germany.
| |
Collapse
|
303
|
Lu TP, Chen JJ. Subgroup identification for treatment selection in biomarker adaptive design. BMC Med Res Methodol 2015; 15:105. [PMID: 26646831 PMCID: PMC4673750 DOI: 10.1186/s12874-015-0098-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2015] [Accepted: 12/01/2015] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Advances in molecular technology have shifted new drug development toward targeted therapy for treatments expected to benefit subpopulations of patients. Adaptive signature design (ASD) has been proposed to identify the most suitable target patient subgroup to enhance efficacy of treatment effect. There are two essential aspects in the development of biomarker adaptive designs: 1) an accurate classifier to identify the most appropriate treatment for patients, and 2) statistical tests to detect treatment effect in the relevant population and subpopulations. We propose utilization of classification methods to identity patient subgroups and present a statistical testing strategy to detect treatment effects. METHODS The diagonal linear discriminant analysis (DLDA) is used to identify targeted and non-targeted subgroups. For binary endpoints, DLDA is directly applied to classify patient into two subgroups; for continuous endpoints, a two-step procedure involving model fitting and determination of a cutoff-point is used for subgroup classification. The proposed strategy includes tests for treatment effect in all patients and in a marker-positive subgroup, with a possible follow-up estimation of treatment effect in the marker-negative subgroup. The proposed method is compared to the ASD classification method using simulated datasets and two publically available cancer datasets. RESULTS The DLDA-based classifier performs well in terms of sensitivity, specificity, positive and negative predictive values, and accuracy in the simulation data and the two cancer datasets, with superior accuracy compared to the ASD method. The subgroup testing strategy is shown to be useful in detecting treatment effect in terms of power and control of study-wise error. CONCLUSION Accuracy of a classifier is essential for adaptive designs. A poor classifier not only assigns patients to inappropriate treatments, but also reduces the power of the test, resulting in incorrect conclusions. The proposed procedure provides an effective approach for subgroup identification and subgroup analysis.
Collapse
Affiliation(s)
- Tzu-Pin Lu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, Food and Drug Administration, 3900 NCTR Road, HFT-20, Jefferson, AR, 72079, USA. .,Department of Public Health, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan.
| | - James J Chen
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, Food and Drug Administration, 3900 NCTR Road, HFT-20, Jefferson, AR, 72079, USA. .,Graduate Institute of Biostatistics, China Medical University, Taichung, Taiwan.
| |
Collapse
|
304
|
Nam JH, Kim D. Modified linear discriminant analysis using block covariance matrix in high-dimensional data. COMMUN STAT-SIMUL C 2015. [DOI: 10.1080/03610918.2015.1014103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
305
|
Chen JJ, Lu TP, Chen YC, Lin WJ. Predictive biomarkers for treatment selection: statistical considerations. Biomark Med 2015; 9:1121-35. [DOI: 10.2217/bmm.15.84] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Predictive biomarkers are developed for treatment selection to identify patients who are likely to benefit from a particular therapy. This review describes statistical methods and discusses issues in the development of predictive biomarkers to enhance study efficiency for detection of treatment effect on the selected responder patients in clinical studies. The statistical procedure for treatment selection consists of three components: biomarker identification, subgroup selection and clinical utility assessment. Major statistical issues discussed include biomarker designs, procedures to identify predictive biomarkers, classification models for subgroup selection, subgroup analysis and multiple testing for clinical utility assessment and evaluation.
Collapse
Affiliation(s)
- James J Chen
- Division of Bioinformatics & Biostatistics, National Center for Toxicological Research, US Food & Drug Administration, Jefferson, AR 72079, USA
- Graduate Institute of Biostatistics, China Medical University, Taichung, Taiwan
| | - Tzu-Pin Lu
- Department of Public Health, Institute of Epidemiology & Preventive Medicine, National Taiwan University, Taipei, Taiwan
| | - Yu-Chuan Chen
- Division of Bioinformatics & Biostatistics, National Center for Toxicological Research, US Food & Drug Administration, Jefferson, AR 72079, USA
| | - Wei-Jiun Lin
- Department of Applied Mathematics, Feng Chia University, Taichung, Taiwan
| |
Collapse
|
306
|
Choi SB, Park JS, Chung JW, Yoo TK, Kim DW. Multicategory classification of 11 neuromuscular diseases based on microarray data using support vector machine. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2015; 2014:3460-3. [PMID: 25570735 DOI: 10.1109/embc.2014.6944367] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
We applied multicategory machine learning methods to classify 11 neuromuscular disease groups and one control group based on microarray data. To develop multicategory classification models with optimal parameters and features, we performed a systematic evaluation of three machine learning algorithms and four feature selection methods using three-fold cross validation and a grid search. This study included 114 subjects of 11 neuromuscular diseases and 31 subjects of a control group using microarray data with 22,283 probe sets from the National Center for Biotechnology Information (NCBI). We obtained an accuracy of 100%, relative classifier information (RCI) of 1.0, and a kappa index of 1.0 by applying the models of support vector machines one-versus-one (SVM-OVO), SVM one-versus-rest (OVR), and directed acyclic graph SVM (DAGSVM), using the ratio of genes between categories to within-category sums of squares (BW) feature selection method. Each of these three models selected only four features to categorize the 12 groups, resulting in a time-saving and cost-effective strategy for diagnosing neuromuscular diseases. In addition, a gene symbol, SPP1 was selected as the top-ranked gene by the BW method. We confirmed relationships between the gene (SPP1) and Duchenne muscular dystrophy (DMD) from a previous study. With our models as clinically helpful tools, neuromuscular diseases could be classified quickly using a computer, thereby giving a time-saving, cost-effective, and accurate diagnosis.
Collapse
|
307
|
Rosa MJ, Mehta MA, Pich EM, Risterucci C, Zelaya F, Reinders AATS, Williams SCR, Dazzan P, Doyle OM, Marquand AF. Estimating multivariate similarity between neuroimaging datasets with sparse canonical correlation analysis: an application to perfusion imaging. Front Neurosci 2015; 9:366. [PMID: 26528117 PMCID: PMC4603249 DOI: 10.3389/fnins.2015.00366] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2015] [Accepted: 09/23/2015] [Indexed: 01/16/2023] Open
Abstract
An increasing number of neuroimaging studies are based on either combining more than one data modality (inter-modal) or combining more than one measurement from the same modality (intra-modal). To date, most intra-modal studies using multivariate statistics have focused on differences between datasets, for instance relying on classifiers to differentiate between effects in the data. However, to fully characterize these effects, multivariate methods able to measure similarities between datasets are needed. One classical technique for estimating the relationship between two datasets is canonical correlation analysis (CCA). However, in the context of high-dimensional data the application of CCA is extremely challenging. A recent extension of CCA, sparse CCA (SCCA), overcomes this limitation, by regularizing the model parameters while yielding a sparse solution. In this work, we modify SCCA with the aim of facilitating its application to high-dimensional neuroimaging data and finding meaningful multivariate image-to-image correspondences in intra-modal studies. In particular, we show how the optimal subset of variables can be estimated independently and we look at the information encoded in more than one set of SCCA transformations. We illustrate our framework using Arterial Spin Labeling data to investigate multivariate similarities between the effects of two antipsychotic drugs on cerebral blood flow.
Collapse
Affiliation(s)
- Maria J. Rosa
- Centre for Neuroimaging Sciences, Institute of Psychiatry, Psychology and Neuroscience, King's College LondonLondon, UK
| | - Mitul A. Mehta
- Centre for Neuroimaging Sciences, Institute of Psychiatry, Psychology and Neuroscience, King's College LondonLondon, UK
| | | | | | - Fernando Zelaya
- Centre for Neuroimaging Sciences, Institute of Psychiatry, Psychology and Neuroscience, King's College LondonLondon, UK
| | - Antje A. T. S. Reinders
- Department of Psychosis Studies, Institute of Psychiatry, Psychology and Neuroscience, King's College LondonLondon, UK
| | - Steve C. R. Williams
- Centre for Neuroimaging Sciences, Institute of Psychiatry, Psychology and Neuroscience, King's College LondonLondon, UK
| | - Paola Dazzan
- Department of Psychosis Studies, Institute of Psychiatry, Psychology and Neuroscience, King's College LondonLondon, UK
- National Institute for Health Research Mental Health Biomedical Research Centre, South London and Maudsley National Health Service Foundation Trust, King's College LondonLondon, UK
| | - Orla M. Doyle
- Centre for Neuroimaging Sciences, Institute of Psychiatry, Psychology and Neuroscience, King's College LondonLondon, UK
| | - Andre F. Marquand
- Centre for Neuroimaging Sciences, Institute of Psychiatry, Psychology and Neuroscience, King's College LondonLondon, UK
- Department of Cognitive Neuroscience, Radboud University Medical Centre, Donders Institute for Brain, Cognition and Behaviour, Radboud UniversityNijmegen, Netherlands
| |
Collapse
|
308
|
Madahian B, Roy S, Bowman D, Deng LY, Homayouni R. A Bayesian approach for inducing sparsity in generalized linear models with multi-category response. BMC Bioinformatics 2015; 16 Suppl 13:S13. [PMID: 26423345 PMCID: PMC4597416 DOI: 10.1186/1471-2105-16-s13-s13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND The dimension and complexity of high-throughput gene expression data create many challenges for downstream analysis. Several approaches exist to reduce the number of variables with respect to small sample sizes. In this study, we utilized the Generalized Double Pareto (GDP) prior to induce sparsity in a Bayesian Generalized Linear Model (GLM) setting. The approach was evaluated using a publicly available microarray dataset containing 99 samples corresponding to four different prostate cancer subtypes. RESULTS A hierarchical Sparse Bayesian GLM using GDP prior (SBGG) was developed to take into account the progressive nature of the response variable. We obtained an average overall classification accuracy between 82.5% and 94%, which was higher than Support Vector Machine, Random Forest or a Sparse Bayesian GLM using double exponential priors. Additionally, SBGG outperforms the other 3 methods in correctly identifying pre-metastatic stages of cancer progression, which can prove extremely valuable for therapeutic and diagnostic purposes. Importantly, using Geneset Cohesion Analysis Tool, we found that the top 100 genes produced by SBGG had an average functional cohesion p-value of 2.0E-4 compared to 0.007 to 0.131 produced by the other methods. CONCLUSIONS Using GDP in a Bayesian GLM model applied to cancer progression data results in better subclass prediction. In particular, the method identifies pre-metastatic stages of prostate cancer with substantially better accuracy and produces more functionally relevant gene sets.
Collapse
|
309
|
|
310
|
|
311
|
Blagus R, Lusa L. Boosting for high-dimensional two-class prediction. BMC Bioinformatics 2015; 16:300. [PMID: 26390865 PMCID: PMC4578758 DOI: 10.1186/s12859-015-0723-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 08/26/2015] [Indexed: 11/30/2022] Open
Abstract
Background In clinical research prediction models are used to accurately predict the outcome of the patients based on some of their characteristics. For high-dimensional prediction models (the number of variables greatly exceeds the number of samples) the choice of an appropriate classifier is crucial as it was observed that no single classification algorithm performs optimally for all types of data. Boosting was proposed as a method that combines the classification results obtained using base classifiers, where the sample weights are sequentially adjusted based on the performance in previous iterations. Generally boosting outperforms any individual classifier, but studies with high-dimensional data showed that the most standard boosting algorithm, AdaBoost.M1, cannot significantly improve the performance of its base classier. Recently other boosting algorithms were proposed (Gradient boosting, Stochastic Gradient boosting, LogitBoost); they were shown to perform better than AdaBoost.M1 but their performance was not evaluated for high-dimensional data. Results In this paper we use simulation studies and real gene-expression data sets to evaluate the performance of boosting algorithms when data are high-dimensional. Our results confirm that AdaBoost.M1 can perform poorly in this setting, often failing to improve the performance of its base classifier. We provide the explanation for this and propose a modification, AdaBoost.M1.ICV, which uses cross-validated estimates of the prediction errors and outperforms the original algorithm when data are high-dimensional. The use of AdaBoost.M1.ICV is advisable when the base classifier overfits the training data: the number of variables is large, the number of samples is small, and/or the difference between the classes is large. To a lesser extent also Gradient boosting suffers from similar problems. Contrary to the findings for the low-dimensional data, shrinkage does not improve the performance of Gradient boosting when data are high-dimensional, however it is beneficial for Stochastic Gradient boosting, which outperformed the other boosting algorithms in our analyses. LogitBoost suffers from overfitting and generally performs poorly. Conclusions The results show that boosting can substantially improve the performance of its base classifier also when data are high-dimensional. However, not all boosting algorithms perform equally well. LogitBoost, AdaBoost.M1 and Gradient boosting seem less useful for this type of data. Overall, Stochastic Gradient boosting with shrinkage and AdaBoost.M1.ICV seem to be the preferable choices for high-dimensional class-prediction. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0723-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rok Blagus
- Institute for Biostatistics and Medical Informatics, University of Ljubljana, Vrazov trg 2, Ljubljana, Slovenia.
| | - Lara Lusa
- Institute for Biostatistics and Medical Informatics, University of Ljubljana, Vrazov trg 2, Ljubljana, Slovenia.
| |
Collapse
|
312
|
Statistical Methods for Establishing Personalized Treatment Rules in Oncology. BIOMED RESEARCH INTERNATIONAL 2015; 2015:670691. [PMID: 26446492 PMCID: PMC4584067 DOI: 10.1155/2015/670691] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/25/2014] [Accepted: 02/09/2015] [Indexed: 12/23/2022]
Abstract
The process for using statistical inference to establish personalized treatment strategies requires
specific techniques for data-analysis that optimize the combination of competing therapies
with candidate genetic features and characteristics of the patient and disease. A wide variety
of methods have been developed. However, heretofore the usefulness of these recent advances
has not been fully recognized by the oncology community, and the scope of their applications
has not been summarized. In this paper, we provide an overview of statistical methods for
establishing optimal treatment rules for personalized medicine and discuss specific examples in
various medical contexts with oncology as an emphasis. We also point the reader to statistical
software for implementation of the methods when available.
Collapse
|
313
|
Wang A, Sarwal MM. Computational Models for Transplant Biomarker Discovery. Front Immunol 2015; 6:458. [PMID: 26441963 PMCID: PMC4561798 DOI: 10.3389/fimmu.2015.00458] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Accepted: 08/24/2015] [Indexed: 01/11/2023] Open
Abstract
Translational medicine offers a rich promise for improved diagnostics and drug discovery for biomedical research in the field of transplantation, where continued unmet diagnostic and therapeutic needs persist. Current advent of genomics and proteomics profiling called "omics" provides new resources to develop novel biomarkers for clinical routine. Establishing such a marker system heavily depends on appropriate applications of computational algorithms and software, which are basically based on mathematical theories and models. Understanding these theories would help to apply appropriate algorithms to ensure biomarker systems successful. Here, we review the key advances in theories and mathematical models relevant to transplant biomarker developments. Advantages and limitations inherent inside these models are discussed. The principles of key -computational approaches for selecting efficiently the best subset of biomarkers from high--dimensional omics data are highlighted. Prediction models are also introduced, and the integration of multi-microarray data is also discussed. Appreciating these key advances would help to accelerate the development of clinically reliable biomarker systems.
Collapse
Affiliation(s)
- Anyou Wang
- Department of Surgery, Division of MultiOrgan Transplantation, University of California San Francisco, San Francisco, CA, USA
| | - Minnie M. Sarwal
- Department of Surgery, Division of MultiOrgan Transplantation, University of California San Francisco, San Francisco, CA, USA
| |
Collapse
|
314
|
Martins M, Santos C, Costa L, Frizera A. Feature reduction and multi-classification of different assistive devices according to the gait pattern. Disabil Rehabil Assist Technol 2015; 11:202-18. [PMID: 26337072 DOI: 10.3109/17483107.2015.1079652] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Total knee arthroplasty (TKA) is a surgical procedure used in patients with Osteoarthritis to improve their state. An understanding about how gait patterns differ from patient to patient and are influenced by the assistive device (AD) that is prescribed is still missing. This article focuses on such purpose. Standard walker, crutches and rollator were tested. Symmetric indexes of spatiotemporal and postural control features were calculated. In order to select the important features which can discriminate the differences among the ADs, different techniques for feature selection are investigated. Classification is handled by Multi-class Support Vector Machine. Results showed that rollator provides a more symmetrical gait and crutches demonstrated to be the worst. Relatively to postural control parameters, standard walker is the most stable and crutches are the worst AD. This means that, depending on the patient's problem and the recovery goal, different ADs should be used. After selecting a set of 16 important features, through correlation, it was demonstrated that they provide important quantitative information about the functional capacity, which is not represented by velocity, cadence and clinical scales. Also, they were capable of distinguishing the gait patterns influenced by each AD, showing that each patient has different needs during recovery. Implications of Rehabilitation An understanding about how gait patterns of post-surgical patients differ from person to person and how they are influenced by the type of device that is prescribed during their recovery might help in physical therapy. Research specifically addressing these issues is still missing. Inter-limb asymmetry and postural control features can be evaluated in an outpatient setting, supplying important additional information about individual gait pattern, which is not represented by gait velocity, cadence and scales usually used. The features calculated in this study are able to provide complementary information to gait velocity, cadence and clinical scales to assess the functional capacity of patients that passed through TKA. The selected parameters make a new clinical tool useful for tracking the evolution of patients' recovery after TKA.
Collapse
Affiliation(s)
| | | | | | - Anselmo Frizera
- b Electrical Engineering Department, Federal University of Espirito Santo , Vitória , Brazil
| |
Collapse
|
315
|
Potential of DNA methylation in rectal cancer as diagnostic and prognostic biomarkers. Br J Cancer 2015; 113:1035-45. [PMID: 26335606 PMCID: PMC4651135 DOI: 10.1038/bjc.2015.303] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2015] [Revised: 06/17/2015] [Accepted: 07/30/2015] [Indexed: 12/15/2022] Open
Abstract
Background: Aberrant DNA methylation is more prominent in proximal compared with distal colorectal cancers. Although a number of methylation markers were identified for colon cancer, yet few are available for rectal cancer. Methods: DNA methylation differences were assessed by a targeted DNA microarray for 360 marker candidates between 22 fresh frozen rectal tumour samples and 8 controls and validated by microfluidic high-throughput and methylation-sensitive qPCR in fresh frozen and formalin-fixed paraffin-embedded (FFPE) samples, respectively. The CpG island methylator phenotype (CIMP) was assessed by MethyLight in FFPE material from 78 patients with pT2 and pT3 rectal adenocarcinoma. Results: We identified and confirmed two novel three-gene signatures in fresh frozen samples that can distinguish tumours from adjacent tissue as well as from blood with a high sensitivity and specificity of up to 1 and an AUC of 1. In addition, methylation of individual CIMP markers was associated with specific clinical parameters such as tumour stage, therapy or patients' age. Methylation of CDKN2A was a negative prognostic factor for overall survival of patients. Conclusions: The newly defined methylation markers will be suitable for early disease detection and monitoring of rectal cancer.
Collapse
|
316
|
McConkey DJ, Choi W, Shen Y, Lee IL, Porten S, Matin SF, Kamat AM, Corn P, Millikan RE, Dinney C, Czerniak B, Siefker-Radtke AO. A Prognostic Gene Expression Signature in the Molecular Classification of Chemotherapy-naïve Urothelial Cancer is Predictive of Clinical Outcomes from Neoadjuvant Chemotherapy: A Phase 2 Trial of Dose-dense Methotrexate, Vinblastine, Doxorubicin, and Cisplatin with Bevacizumab in Urothelial Cancer. Eur Urol 2015; 69:855-62. [PMID: 26343003 DOI: 10.1016/j.eururo.2015.08.034] [Citation(s) in RCA: 205] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2015] [Accepted: 08/19/2015] [Indexed: 10/23/2022]
Abstract
BACKGROUND Gene expression profiling (GEP) suggests there are three subtypes of muscle-invasive urothelial cancer (UC): basal, which has the worst prognosis; p53-like; and luminal. We hypothesized that GEP of transurethral resection (TUR) and cystectomy specimens would predict subtypes that could benefit from chemotherapy. OBJECTIVE To explore clinical outcomes for patients treated with dose-dense (DD) methotrexate, vinblastine, doxorubicin, and cisplatin (MVAC) and bevacizumab (B) and the impact of UC subtype. DESIGN, SETTING, AND PARTICIPANTS Sixty patients enrolled in a neoadjuvant trial of four cycles of DDMVAC + B between 2007 and 2010. TUR and cystectomy specimens for GEP were available from 38 and 23 patients, respectively, and from an additional confirmation cohort of 49 patients treated with perioperative MVAC. OUTCOME MEASUREMENTS AND STATISTICAL ANALYSIS Relationships with outcomes were analyzed using multivariable Cox regression and log-rank tests. RESULTS AND LIMITATIONS Chemotherapy was active, with pT0N0 and ≤pT1N0 downstaging rates of 38% and 53%, respectively, and 5-yr overall survival (OS) of 63%. Bevacizumab had no appreciable impact on outcomes. Basal tumors had improved survival compared to luminal and p53-like tumors (5-yr OS 91%, 73%, and 36%, log-rank p=0.015), with similar findings on multivariate analysis. Bone metastases within 2 yr were exclusively associated with the p53-like subtype (p53-like 100%, luminal 0%, basal 0%; p ≤ 0.001). Tumors enriched with the p53-like subtype at cystectomy suggested chemoresistance for this subtype. A separate cohort treated with perioperative MVAC confirmed the UC subtype survival benefit (5-yr OS 77% for basal, 56% for luminal, and 56% for p53-like; p=0.021). Limitations include the small number of pretreatment specimens with sufficient tissue for GEP. CONCLUSION GEP was predictive of clinical UC outcomes. The basal subtype was associated with better survival, and the p53-like subtype was associated with bone metastases and chemoresistant disease. PATIENT SUMMARY We can no longer think of urothelial cancer as a single disease. Gene expression profiling identifies subtypes of urothelial cancer that differ in their natural history and sensitivity to chemotherapy.
Collapse
Affiliation(s)
- David J McConkey
- Department of Urology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA; Department of Cancer Biology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA
| | - Woonyoung Choi
- Department of Urology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA
| | - Yu Shen
- Department of Statistics, U.T. M.D. Anderson Cancer Center, Houston, TX, USA
| | - I-Ling Lee
- Department of Urology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA
| | - Sima Porten
- Department of Urology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA
| | - Surena F Matin
- Department of Urology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA
| | - Ashish M Kamat
- Department of Urology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA
| | - Paul Corn
- Department of Genitourinary Medical Oncology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA
| | | | - Colin Dinney
- Department of Urology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA
| | - Bogdan Czerniak
- Department of Pathology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA
| | - Arlene O Siefker-Radtke
- Department of Genitourinary Medical Oncology, U.T. M.D. Anderson Cancer Center, Houston, TX, USA.
| |
Collapse
|
317
|
|
318
|
Similarity-balanced discriminant neighbor embedding and its application to cancer classification based on gene expression data. Comput Biol Med 2015; 64:236-45. [DOI: 10.1016/j.compbiomed.2015.07.008] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Revised: 07/08/2015] [Accepted: 07/10/2015] [Indexed: 11/21/2022]
|
319
|
Boulesteix AL, Hable R, Lauer S, Eugster MJA. A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies. AM STAT 2015. [DOI: 10.1080/00031305.2015.1005128] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
320
|
|
321
|
Evaluation of Penalized and Nonpenalized Methods for Disease Prediction with Large-Scale Genetic Data. BIOMED RESEARCH INTERNATIONAL 2015; 2015:605891. [PMID: 26346893 PMCID: PMC4539442 DOI: 10.1155/2015/605891] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/03/2014] [Revised: 01/16/2015] [Accepted: 01/16/2015] [Indexed: 01/31/2023]
Abstract
Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called “large P and small N” problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better accuracy than the existing methods for at least diseases under consideration.
Collapse
|
322
|
Zhao J, Shi L, Zhu J. Two-Stage Regularized Linear Discriminant Analysis for 2-D Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2015; 26:1669-1681. [PMID: 25204000 DOI: 10.1109/tnnls.2014.2350993] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Fisher linear discriminant analysis (LDA) involves within-class and between-class covariance matrices. For 2-D data such as images, regularized LDA (RLDA) can improve LDA due to the regularized eigenvalues of the estimated within-class matrix. However, it fails to consider the eigenvectors and the estimated between-class matrix. To improve these two matrices simultaneously, we propose in this paper a new two-stage method for 2-D data, namely a bidirectional LDA (BLDA) in the first stage and the RLDA in the second stage, where both BLDA and RLDA are based on the Fisher criterion that tackles correlation. BLDA performs the LDA under special separable covariance constraints that incorporate the row and column correlations inherent in 2-D data. The main novelty is that we propose a simple but effective statistical test to determine the subspace dimensionality in the first stage. As a result, the first stage reduces the dimensionality substantially while keeping the significant discriminant information in the data. This enables the second stage to perform RLDA in a much lower dimensional subspace, and thus improves the two estimated matrices simultaneously. Experiments on a number of 2-D synthetic and real-world data sets show that BLDA+RLDA outperforms several closely related competitors.
Collapse
|
323
|
Baker SG, Kramer BS. Evaluating surrogate endpoints, prognostic markers, and predictive markers: Some simple themes. Clin Trials 2015; 12:299-308. [PMID: 25385934 PMCID: PMC4451440 DOI: 10.1177/1740774514557725] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
BACKGROUND A surrogate endpoint is an endpoint observed earlier than the true endpoint (a health outcome) that is used to draw conclusions about the effect of treatment on the unobserved true endpoint. A prognostic marker is a marker for predicting the risk of an event given a control treatment; it informs treatment decisions when there is information on anticipated benefits and harms of a new treatment applied to persons at high risk. A predictive marker is a marker for predicting the effect of treatment on outcome in a subgroup of patients or study participants; it provides more rigorous information for treatment selection than a prognostic marker when it is based on estimated treatment effects in a randomized trial. METHODS We organized our discussion around a different theme for each topic. RESULTS "Fundamentally an extrapolation" refers to the non-statistical considerations and assumptions needed when using surrogate endpoints to evaluate a new treatment. "Decision analysis to the rescue" refers to use the use of decision analysis to evaluate an additional prognostic marker because it is not possible to choose between purely statistical measures of marker performance. "The appeal of simplicity" refers to a straightforward and efficient use of a single randomized trial to evaluate overall treatment effect and treatment effect within subgroups using predictive markers. CONCLUSION The simple themes provide a general guideline for evaluation of surrogate endpoints, prognostic markers, and predictive markers.
Collapse
Affiliation(s)
- Stuart G Baker
- Division of Cancer Prevention, National Cancer Institute, Bethesda MD, USA
| | - Barnett S Kramer
- Division of Cancer Prevention, National Cancer Institute, Bethesda MD, USA
| |
Collapse
|
324
|
Yang TY. A Simple Rank Product Approach for Analyzing Two Classes. Bioinform Biol Insights 2015; 9:119-23. [PMID: 26244016 PMCID: PMC4507469 DOI: 10.4137/bbi.s26414] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Revised: 06/14/2015] [Accepted: 06/21/2015] [Indexed: 11/24/2022] Open
Abstract
The rank product statistic has been widely used to detect differentially expressed genes in replicated microarrays and a one-class setting. The objective of this article is to apply a rank product statistic to approximate the P-value of differential expression in a two-class setting, such as in normal and cancer cells. For this purpose, we introduce a simple statistic that compares the P-values of each class’s rank product statistic. Its null distribution is straightforwardly derived using the change-of-variable technique.
Collapse
Affiliation(s)
- Tae Young Yang
- Department of Mathematics, Myongji University, Yongin, Kyonggi, Korea
| |
Collapse
|
325
|
Li Y, Si J, Zhou G, Huang S, Chen S. FREL: A Stable Feature Selection Algorithm. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2015; 26:1388-1402. [PMID: 25134091 DOI: 10.1109/tnnls.2014.2341627] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Two factors characterize a good feature selection algorithm: its accuracy and stability. This paper aims at introducing a new approach to stable feature selection algorithms. The innovation of this paper centers on a class of stable feature selection algorithms called feature weighting as regularized energy-based learning (FREL). Stability properties of FREL using L1 or L2 regularization are investigated. In addition, as a commonly adopted implementation strategy for enhanced stability, an ensemble FREL is proposed. A stability bound for the ensemble FREL is also presented. Our experiments using open source real microarray data, which are challenging high dimensionality small sample size problems demonstrate that our proposed ensemble FREL is not only stable but also achieves better or comparable accuracy than some other popular stable feature weighting methods.
Collapse
|
326
|
Factors affecting the accuracy of a class prediction model in gene expression data. BMC Bioinformatics 2015; 16:199. [PMID: 26093633 PMCID: PMC4475623 DOI: 10.1186/s12859-015-0610-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2015] [Accepted: 04/30/2015] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Class prediction models have been shown to have varying performances in clinical gene expression datasets. Previous evaluation studies, mostly done in the field of cancer, showed that the accuracy of class prediction models differs from dataset to dataset and depends on the type of classification function. While a substantial amount of information is known about the characteristics of classification functions, little has been done to determine which characteristics of gene expression data have impact on the performance of a classifier. This study aims to empirically identify data characteristics that affect the predictive accuracy of classification models, outside of the field of cancer. RESULTS Datasets from twenty five studies meeting predefined inclusion and exclusion criteria were downloaded. Nine classification functions were chosen, falling within the categories: discriminant analyses or Bayes classifiers, tree based, regularization and shrinkage and nearest neighbors methods. Consequently, nine class prediction models were built for each dataset using the same procedure and their performances were evaluated by calculating their accuracies. The characteristics of each experiment were recorded, (i.e., observed disease, medical question, tissue/cell types and sample size) together with characteristics of the gene expression data, namely the number of differentially expressed genes, the fold changes and the within-class correlations. Their effects on the accuracy of a class prediction model were statistically assessed by random effects logistic regression. The number of differentially expressed genes and the average fold change had significant impact on the accuracy of a classification model and gave individual explained-variation in prediction accuracy of up to 72% and 57%, respectively. Multivariable random effects logistic regression with forward selection yielded the two aforementioned study factors and the within class correlation as factors affecting the accuracy of classification functions, explaining 91.5% of the between study variation. CONCLUSIONS We evaluated study- and data-related factors that might explain the varying performances of classification functions in non-cancerous datasets. Our results showed that the number of differentially expressed genes, the fold change, and the correlation in gene expression data significantly affect the accuracy of class prediction models.
Collapse
|
327
|
DeSart K, O'Malley K, Schmit B, Lopez MC, Moldawer L, Baker H, Berceli S, Nelson P. Systemic inflammation as a predictor of clinical outcomes after lower extremity angioplasty/stenting. J Vasc Surg 2015; 64:766-778.e5. [PMID: 26054584 DOI: 10.1016/j.jvs.2015.04.399] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2014] [Accepted: 04/18/2015] [Indexed: 11/16/2022]
Abstract
OBJECTIVE The activation state of the systemic inflammatory milieu has been proposed as a critical regulator of vascular repair after injury. We evaluated the early inflammatory response after endovascular intervention for symptomatic peripheral arterial disease to determine its association with clinical success or failure. METHODS Blood samples were obtained from 14 patients undergoing lower extremity angioplasty/stenting and analyzed using high-throughput gene arrays, multiplex serum protein analyses, and flow cytometry. RESULTS Time-dependent plasma protein and monocyte phenotype analyses demonstrated endovascular revascularization had a modest influence on the overall activation state of the systemic inflammatory system, with baseline variability exceeding the perturbations induced by the intervention. In contrast, specific time-dependent changes in the monocyte genome are evident in the initial 28 days, predominately in those genes associated with leukocyte extravasation. Investigating the relationship between inflammation and the 1-year success or failure of the intervention showed no single plasma protein was correlated with outcome, but a more comprehensive cluster analysis revealed a clear pattern of protein expression that was closely related to the clinical phenotype. Corresponding examination of the monocyte genome identified a gene subset at 1 day postprocedure that was predictive of clinical outcome, with most of these genes active in cell-cycle signaling. CONCLUSIONS Although the global influence of angioplasty/stenting on systemic inflammation was modest, circulating cytokine and monocyte genome analyses support a pattern of early inflammation that is associated with ultimate intervention success vs failure. Molecular profiles incorporating genes involved in monocyte cell-cycle progression and homing, or proinflammatory cytokines, or both, offer the most promise for the development of class prediction tools for clinical application.
Collapse
Affiliation(s)
- Kenneth DeSart
- Department of Surgery, University of Florida College of Medicine, Gainesville, Fla
| | - Kerri O'Malley
- Department of Surgery, University of Florida College of Medicine, Gainesville, Fla
| | - Bradley Schmit
- Department of Surgery, University of Florida College of Medicine, Gainesville, Fla
| | - Maria-Cecilia Lopez
- Department of Molecular Genetics and Microbiology, University of Florida College of Medicine, Gainesville, Fla
| | - Lyle Moldawer
- Department of Surgery, University of Florida College of Medicine, Gainesville, Fla
| | - Henry Baker
- Department of Molecular Genetics and Microbiology, University of Florida College of Medicine, Gainesville, Fla
| | - Scott Berceli
- Department of Surgery, University of Florida College of Medicine, Gainesville, Fla; Malcom Randall VA Medical Center, Gainesville, Fla
| | - Peter Nelson
- Division of Vascular and Endovascular Surgery, University of South Florida Morsani College of Medicine, Tampa, Fla; James A. Haley VA Medical Center, Tampa, Fla.
| |
Collapse
|
328
|
Lu GF, Zou J, Wang Y. A New and Fast Implementation of Orthogonal LDA Algorithm and Its Incremental Extension. Neural Process Lett 2015. [DOI: 10.1007/s11063-015-9441-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
329
|
Xu Y, Akrotirianakis I, Chakraborty A. Proximal gradient method for huberized support vector machine. Pattern Anal Appl 2015. [DOI: 10.1007/s10044-015-0485-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
330
|
Basavanhally A, Viswanath S, Madabhushi A. Predicting classifier performance with limited training data: applications to computer-aided diagnosis in breast and prostate cancer. PLoS One 2015; 10:e0117900. [PMID: 25993029 PMCID: PMC4436385 DOI: 10.1371/journal.pone.0117900] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2014] [Accepted: 10/28/2014] [Indexed: 11/18/2022] Open
Abstract
Clinical trials increasingly employ medical imaging data in conjunction with supervised classifiers, where the latter require large amounts of training data to accurately model the system. Yet, a classifier selected at the start of the trial based on smaller and more accessible datasets may yield inaccurate and unstable classification performance. In this paper, we aim to address two common concerns in classifier selection for clinical trials: (1) predicting expected classifier performance for large datasets based on error rates calculated from smaller datasets and (2) the selection of appropriate classifiers based on expected performance for larger datasets. We present a framework for comparative evaluation of classifiers using only limited amounts of training data by using random repeated sampling (RRS) in conjunction with a cross-validation sampling strategy. Extrapolated error rates are subsequently validated via comparison with leave-one-out cross-validation performed on a larger dataset. The ability to predict error rates as dataset size increases is demonstrated on both synthetic data as well as three different computational imaging tasks: detecting cancerous image regions in prostate histopathology, differentiating high and low grade cancer in breast histopathology, and detecting cancerous metavoxels in prostate magnetic resonance spectroscopy. For each task, the relationships between 3 distinct classifiers (k-nearest neighbor, naive Bayes, Support Vector Machine) are explored. Further quantitative evaluation in terms of interquartile range (IQR) suggests that our approach consistently yields error rates with lower variability (mean IQRs of 0.0070, 0.0127, and 0.0140) than a traditional RRS approach (mean IQRs of 0.0297, 0.0779, and 0.305) that does not employ cross-validation sampling for all three datasets.
Collapse
Affiliation(s)
- Ajay Basavanhally
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, USA
| | - Satish Viswanath
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, USA
| | - Anant Madabhushi
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, USA
- * E-mail:
| |
Collapse
|
331
|
Ghandhi SA, Smilenov LB, Elliston CD, Chowdhury M, Amundson SA. Radiation dose-rate effects on gene expression for human biodosimetry. BMC Med Genomics 2015; 8:22. [PMID: 25963628 PMCID: PMC4472181 DOI: 10.1186/s12920-015-0097-x] [Citation(s) in RCA: 84] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Accepted: 05/01/2015] [Indexed: 12/24/2022] Open
Abstract
Background The effects of dose-rate and its implications on radiation biodosimetry methods are not well studied in the context of large-scale radiological scenarios. There are significant health risks to individuals exposed to an acute dose, but a realistic scenario would include exposure to both high and low dose-rates, from both external and internal radioactivity. It is important therefore, to understand the biological response to prolonged exposure; and further, discover biomarkers that can be used to estimate damage from low-dose rate exposures and propose appropriate clinical treatment. Methods We irradiated human whole blood ex vivo to three doses, 0.56 Gy, 2.23 Gy and 4.45 Gy, using two dose rates: acute, 1.03 Gy/min and a low dose-rate, 3.1 mGy/min. After 24 h, we isolated RNA from blood cells and these were hybridized to Agilent Whole Human genome microarrays. We validated the microarray results using qRT-PCR. Results Microarray results showed that there were 454 significantly differentially expressed genes after prolonged exposure to all doses. After acute exposure, 598 genes were differentially expressed in response to all doses. Gene ontology terms enriched in both sets of genes were related to immune processes and B-cell mediated immunity. Genes responding to acute exposure were also enriched in functions related to natural killer cell activation and cell-to-cell signaling. As expected, the p53 pathway was found to be significantly enriched at all doses and by both dose-rates of radiation. A support vectors machine classifier was able to distinguish between dose-rates with 100 % accuracy using leave-one-out cross-validation. Conclusions In this study we found that low dose-rate exposure can result in distinctive gene expression patterns compared with acute exposures. We were able to successfully distinguish low dose-rate exposed samples from acute dose exposed samples at 24 h, using a gene expression-based classifier. These genes are candidates for further testing as markers to classify exposure based on dose-rate. Electronic supplementary material The online version of this article (doi:10.1186/s12920-015-0097-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shanaz A Ghandhi
- Center for Radiological Research, Columbia University, VC11-237, 630 West 168th Street, New York, NY, 10032, USA.
| | - Lubomir B Smilenov
- Center for Radiological Research, Columbia University, VC11-237, 630 West 168th Street, New York, NY, 10032, USA.
| | - Carl D Elliston
- Center for Radiological Research, Columbia University, VC11-237, 630 West 168th Street, New York, NY, 10032, USA.
| | - Mashkura Chowdhury
- Center for Radiological Research, Columbia University, VC11-237, 630 West 168th Street, New York, NY, 10032, USA.
| | - Sally A Amundson
- Center for Radiological Research, Columbia University, VC11-237, 630 West 168th Street, New York, NY, 10032, USA.
| |
Collapse
|
332
|
Geman D, Ochs M, Price ND, Tomasetti C, Younes L. An argument for mechanism-based statistical inference in cancer. Hum Genet 2015; 134:479-95. [PMID: 25381197 PMCID: PMC4612627 DOI: 10.1007/s00439-014-1501-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2014] [Accepted: 10/14/2014] [Indexed: 01/07/2023]
Abstract
Cancer is perhaps the prototypical systems disease, and as such has been the focus of extensive study in quantitative systems biology. However, translating these programs into personalized clinical care remains elusive and incomplete. In this perspective, we argue that realizing this agenda—in particular, predicting disease phenotypes, progression and treatment response for individuals—requires going well beyond standard computational and bioinformatics tools and algorithms. It entails designing global mathematical models over network-scale configurations of genomic states and molecular concentrations, and learning the model parameters from limited available samples of high-dimensional and integrative omics data. As such, any plausible design should accommodate: biological mechanism, necessary for both feasible learning and interpretable decision making; stochasticity, to deal with uncertainty and observed variation at many scales; and a capacity for statistical inference at the patient level. This program, which requires a close, sustained collaboration between mathematicians and biologists, is illustrated in several contexts, including learning biomarkers, metabolism, cell signaling, network inference and tumorigenesis.
Collapse
Affiliation(s)
- Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21210, USA,
| | | | | | | | | |
Collapse
|
333
|
Leong HS, Galletta L, Etemadmoghadam D, George J, Köbel M, Ramus SJ, Bowtell D. Efficient molecular subtype classification of high-grade serous ovarian cancer. J Pathol 2015; 236:272-7. [PMID: 25810134 DOI: 10.1002/path.4536] [Citation(s) in RCA: 73] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2015] [Revised: 03/17/2015] [Accepted: 03/18/2015] [Indexed: 12/21/2022]
Abstract
High-grade serous carcinomas (HGSCs) account for approximately 70% of all epithelial ovarian cancers diagnosed. Using microarray gene expression profiling, we previously identified four molecular subtypes of HGSC: C1 (mesenchymal), C2 (immunoreactive), C4 (differentiated), and C5 (proliferative), which correlate with patient survival and have distinct biological features. Here, we describe molecular classification of HGSC based on a limited number of genes to allow cost-effective and high-throughput subtype analysis. We determined a minimal signature for accurate classification, including 39 differentially expressed and nine control genes from microarray experiments. Taqman-based (low-density arrays and Fluidigm), fluorescent oligonucleotides (Nanostring), and targeted RNA sequencing (Illumina) assays were then compared for their ability to correctly classify fresh and formalin-fixed, paraffin-embedded samples. All platforms achieved > 90% classification accuracy with RNA from fresh frozen samples. The Illumina and Nanostring assays were superior with fixed material. We found that the C1, C2, and C4 molecular subtypes were largely consistent across multiple surgical deposits from individual chemo-naive patients. In contrast, we observed substantial subtype heterogeneity in patients whose primary ovarian sample was classified as C5. The development of an efficient molecular classifier of HGSC should enable further biological characterization of molecular subtypes and the development of targeted clinical trials.
Collapse
Affiliation(s)
- Huei San Leong
- The Peter MacCallum Cancer Centre, East Melbourne, Victoria, Australia
| | - Laura Galletta
- The Peter MacCallum Cancer Centre, East Melbourne, Victoria, Australia
| | - Dariush Etemadmoghadam
- The Peter MacCallum Cancer Centre, East Melbourne, Victoria, Australia.,Department of Pathology, University of Melbourne, Melbourne, Victoria, Australia.,Sir Peter MacCallum Cancer Centre Department of Oncology, University of Melbourne, Melbourne, Victoria, Australia
| | - Joshy George
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut, USA
| | | | - Martin Köbel
- Department of Laboratory Medicine and Pathology, University of Calgary, Calgary, Alberta, Canada
| | - Susan J Ramus
- Department of Preventive Medicine, Keck School of Medicine, USC/Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, California, USA
| | - David Bowtell
- The Peter MacCallum Cancer Centre, East Melbourne, Victoria, Australia.,Department of Pathology, University of Melbourne, Melbourne, Victoria, Australia.,Sir Peter MacCallum Cancer Centre Department of Oncology, University of Melbourne, Melbourne, Victoria, Australia.,The Department of Biochemistry, University of Melbourne, Parkville, Victoria, Australia.,Hammersmith Hospital, Imperial College, London, UK
| |
Collapse
|
334
|
Lu TP, Chen JJ. Identification of drug-induced toxicity biomarkers for treatment determination. Pharm Stat 2015; 14:284-93. [PMID: 25914330 DOI: 10.1002/pst.1684] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Revised: 11/18/2014] [Accepted: 03/30/2015] [Indexed: 12/28/2022]
Abstract
Drug-induced organ toxicity (DIOT) that leads to the removal of marketed drugs or termination of candidate drugs has been a leading concern for regulatory agencies and pharmaceutical companies. In safety studies, the genomic assays are conducted after the treatment so that drug-induced adverse effects can occur. Two types of biomarkers are observed: biomarkers of susceptibility and biomarkers of response. This paper presents a statistical model to distinguish two types of biomarkers and procedures to identify susceptible subpopulations. The biomarkers identified are used to develop classification model to identify susceptible subpopulation. Two methods to identify susceptibility biomarkers were evaluated in terms of predictive performance in subpopulation identification, including sensitivity, specificity, and accuracy. Method 1 considered the traditional linear model with a variable-by-treatment interaction term, and Method 2 considered fitting a single predictor variable model using only treatment data. Monte Carlo simulation studies were conducted to evaluate the performance of the two methods and impact of the subpopulation prevalence, probability of DIOT, and sample size on the predictive performance. Method 2 appeared to outperform Method 1, which was due to the lack of power for testing the interaction effect. Important statistical issues and challenges regarding identification of preclinical DIOT biomarkers were discussed. In summary, identification of predictive biomarkers for treatment determination highly depends on the subpopulation prevalence. When the proportion of susceptible subpopulation is 1% or less, a very large sample size is needed to ensure observing sufficient number of DIOT responses for biomarker and/or subpopulation identifications.
Collapse
Affiliation(s)
- Tzu-Pin Lu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, Food and Drug Administration, Jefferson, AR, USA.,Department of Public Health Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| | - James J Chen
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, Food and Drug Administration, Jefferson, AR, USA
| |
Collapse
|
335
|
Gardeux V, Chelouah R, Wanderley MFB, Siarry P, Braga AP, Reyal F, Rouzier R, Pusztai L, Natowicz R. Computing molecular signatures as optima of a bi-objective function: method and application to prediction in oncogenomics. Cancer Inform 2015; 14:33-45. [PMID: 25983540 PMCID: PMC4426938 DOI: 10.4137/cin.s21111] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2014] [Revised: 12/14/2014] [Accepted: 12/17/2014] [Indexed: 11/05/2022] Open
Abstract
BACKGROUND Filter feature selection methods compute molecular signatures by selecting subsets of genes in the ranking of a valuation function. The motivations of the valuation functions choice are almost always clearly stated, but those for selecting the genes according to their ranking are hardly ever explicit. METHOD We addressed the computation of molecular signatures by searching the optima of a bi-objective function whose solution space was the set of all possible molecular signatures, ie, the set of subsets of genes. The two objectives were the size of the signature–to be minimized–and the interclass distance induced by the signature–to be maximized–. RESULTS We showed that: 1) the convex combination of the two objectives had exactly n optimal non empty signatures where n was the number of genes, 2) the n optimal signatures were nested, and 3) the optimal signature of size k was the subset of k top ranked genes that contributed the most to the interclass distance. We applied our feature selection method on five public datasets in oncology, and assessed the prediction performances of the optimal signatures as input to the diagonal linear discriminant analysis (DLDA) classifier. They were at the same level or better than the best-reported ones. The predictions were robust, and the signatures were almost always significantly smaller. We studied in more details the performances of our predictive modeling on two breast cancer datasets to predict the response to a preoperative chemotherapy: the performances were higher than the previously reported ones, the signatures were three times smaller (11 versus 30 gene signatures), and the genes member of the signature were known to be involved in the response to chemotherapy. CONCLUSIONS Defining molecular signatures as the optima of a bi-objective function that combined the signature size and the interclass distance was well founded and efficient for prediction in oncogenomics. The complexity of the computation was very low because the optimal signatures were the sets of genes in the ranking of their valuation. Software can be freely downloaded from http://gardeux-vincent.eu/DeltaRanking.php
Collapse
Affiliation(s)
- Vincent Gardeux
- EISTI engineering school, Department of Computer Science, Cergy, France. ; LISSI laboratory, University of Paris-Est, Créteil, France
| | - Rachid Chelouah
- EISTI engineering school, Department of Computer Science, Cergy, France
| | - Maria F Barbosa Wanderley
- Federal University of Minas Gerais, Laboratório de Inteligência Computacional, Belo Horizonte, Brazil
| | - Patrick Siarry
- LISSI laboratory, University of Paris-Est, Créteil, France
| | - Antônio P Braga
- Federal University of Minas Gerais, Laboratório de Inteligência Computacional, Belo Horizonte, Brazil
| | - Fabien Reyal
- Curie Institute, Department of Translational Research, Paris, France
| | - Roman Rouzier
- Curie Institute, Department of Surgery, Paris, France
| | - Lajos Pusztai
- Breast Medical Oncology, Yale School of Medicine, New Haven, CT, USA
| | - René Natowicz
- ESIEE-Paris, University of Paris-Est, Noisy-le-Grand, France
| |
Collapse
|
336
|
Cacciatore S, Saccenti E, Piccioli M. Hypothesis: the sound of the individual metabolic phenotype? Acoustic detection of NMR experiments. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2015; 19:147-56. [PMID: 25748436 DOI: 10.1089/omi.2014.0131] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
We present here an innovative hypothesis and report preliminary evidence that the sound of NMR signals could provide an alternative to the current representation of the individual metabolic fingerprint and supply equally significant information. The NMR spectra of the urine samples provided by four healthy donors were converted into audio signals that were analyzed in two audio experiments by listeners with both musical and non-musical training. The listeners were first asked to cluster the audio signals of two donors on the basis of perceived similarity and then to classify unknown samples after having listened to a set of reference signals. In the clustering experiment, the probability of obtaining the same results by pure chance was 7.04% and 0.05% for non-musicians and musicians, respectively. In the classification experiment, musicians scored 84% accuracy which compared favorably with the 100% accuracy attained by sophisticated pattern recognition methods. The results were further validated and confirmed by analyzing the NMR metabolic profiles belonging to two other different donors. These findings support our hypothesis that the uniqueness of the metabolic phenotype is preserved even when reproduced as audio signal and warrants further consideration and testing in larger study samples.
Collapse
Affiliation(s)
- Stefano Cacciatore
- 1 Department of Medical Oncology, Dana-Farber Cancer Institute , Boston, Massachusetts
| | | | | |
Collapse
|
337
|
Vasiliu D, Clamons S, McDonough M, Rabe B, Saha M. A regression-based differential expression detection algorithm for microarray studies with ultra-low sample size. PLoS One 2015; 10:e0118198. [PMID: 25738861 PMCID: PMC4349782 DOI: 10.1371/journal.pone.0118198] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2014] [Accepted: 01/08/2015] [Indexed: 02/03/2023] Open
Abstract
Global gene expression analysis using microarrays and, more recently, RNA-seq, has allowed investigators to understand biological processes at a system level. However, the identification of differentially expressed genes in experiments with small sample size, high dimensionality, and high variance remains challenging, limiting the usability of these tens of thousands of publicly available, and possibly many more unpublished, gene expression datasets. We propose a novel variable selection algorithm for ultra-low-n microarray studies using generalized linear model-based variable selection with a penalized binomial regression algorithm called penalized Euclidean distance (PED). Our method uses PED to build a classifier on the experimental data to rank genes by importance. In place of cross-validation, which is required by most similar methods but not reliable for experiments with small sample size, we use a simulation-based approach to additively build a list of differentially expressed genes from the rank-ordered list. Our simulation-based approach maintains a low false discovery rate while maximizing the number of differentially expressed genes identified, a feature critical for downstream pathway analysis. We apply our method to microarray data from an experiment perturbing the Notch signaling pathway in Xenopus laevis embryos. This dataset was chosen because it showed very little differential expression according to limma, a powerful and widely-used method for microarray analysis. Our method was able to detect a significant number of differentially expressed genes in this dataset and suggest future directions for investigation. Our method is easily adaptable for analysis of data from RNA-seq and other global expression experiments with low sample size and high dimensionality.
Collapse
Affiliation(s)
- Daniel Vasiliu
- Department of Mathematics, College of William and Mary, Williamsburg, Virginia, United States of America
| | - Samuel Clamons
- Department of Biology, College of William and Mary, Williamsburg, Virginia, United States of America
| | - Molly McDonough
- Department of Biology, College of William and Mary, Williamsburg, Virginia, United States of America
| | - Brian Rabe
- Department of Biology, College of William and Mary, Williamsburg, Virginia, United States of America
| | - Margaret Saha
- Department of Biology, College of William and Mary, Williamsburg, Virginia, United States of America
- * E-mail:
| |
Collapse
|
338
|
|
339
|
Touloumis A. Nonparametric Stein-type shrinkage covariance matrix estimators in high-dimensional settings. Comput Stat Data Anal 2015. [DOI: 10.1016/j.csda.2014.10.018] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
340
|
de Souto MCP, Jaskowiak PA, Costa IG. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinformatics 2015; 16:64. [PMID: 25888091 PMCID: PMC4350881 DOI: 10.1186/s12859-015-0494-3] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Accepted: 02/09/2015] [Indexed: 12/20/2022] Open
Abstract
Background Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. Results and conclusions We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0494-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Pablo A Jaskowiak
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos - SP, Brazil.
| | - Ivan G Costa
- Center of Informatics, Federal University of Pernambuco, Recife - PE, Brazil. .,IZKF Computational Biology Research Group, Institute for Biomedical Engineering, RWTH Aachen University Medical School, Aachen, Germany.
| |
Collapse
|
341
|
Huang HL, Wu YC, Su LJ, Huang YJ, Charoenkwan P, Chen WL, Lee HC, Chu WCC, Ho SY. Discovery of prognostic biomarkers for predicting lung cancer metastasis using microarray and survival data. BMC Bioinformatics 2015; 16:54. [PMID: 25881029 PMCID: PMC4349617 DOI: 10.1186/s12859-015-0463-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2014] [Accepted: 01/13/2015] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Few studies have investigated prognostic biomarkers of distant metastases of lung cancer. One of the central difficulties in identifying biomarkers from microarray data is the availability of only a small number of samples, which results overtraining. Recently obtained evidence reveals that epithelial-mesenchymal transition (EMT) of tumor cells causes metastasis, which is detrimental to patients' survival. RESULTS This work proposes a novel optimization approach to discovering EMT-related prognostic biomarkers to predict the distant metastasis of lung cancer using both microarray and survival data. This weighted objective function maximizes both the accuracy of prediction of distant metastasis and the area between the disease-free survival curves of the non-distant and distant metastases. Seventy-eight patients with lung cancer and a follow-up time of 120 months are used to identify a set of gene markers and an independent cohort of 26 patients is used to evaluate the identified biomarkers. The medical records of the 78 patients show a significant difference between the disease-free survival times of the 37 non-distant- and the 41 distant-metastasis patients. The experimental results thus obtained are as follows. 1) The use of disease-free survival curves can compensate for the shortcoming of insufficient samples and greatly increase the test accuracy by 11.10%; and 2) the support vector machine with a set of 17 transcripts, such as CCL16 and CDKN2AIP, can yield a leave-one-out cross-validation accuracy of 93.59%, a test accuracy of 76.92%, a large disease-free survival area of 74.81%, and a mean survival prediction error of 3.99 months. The identified putative biomarkers are examined using related studies and signaling pathways to reveal the potential effectiveness of the biomarkers in prospective confirmatory studies. CONCLUSIONS The proposed new optimization approach to identifying prognostic biomarkers by combining multiple sources of data (microarray and survival) can facilitate the accurate selection of biomarkers that are most relevant to the disease while solving the problem of insufficient samples.
Collapse
Affiliation(s)
- Hui-Ling Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan. .,Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan.
| | - Yu-Chung Wu
- Division of Thoracic Surgery, Department of Surgery, Taipei Veterans General Hospital, Taipei, Taiwan.
| | - Li-Jen Su
- Institute of Systems Biology and Bioinformatics, National Central University, Taoyuan, Taiwan.
| | - Yun-Ju Huang
- Institute of Molecular Medicine and Bioengineering, National Chiao Tung University, Hsinchu, Taiwan.
| | - Phasit Charoenkwan
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan.
| | - Wen-Liang Chen
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan.
| | - Hua-Chin Lee
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan. .,Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan.
| | | | - Shinn-Ying Ho
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan. .,Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan.
| |
Collapse
|
342
|
Partovi Nia V, Davison AC. A simple model-based approach to variable selection in classification and clustering. CAN J STAT 2015. [DOI: 10.1002/cjs.11241] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Vahid Partovi Nia
- GERAD Research Center and Department of Mathematical and Industrial Engineering; Polytechnique Montréal; 2900 Edouard-Montpetit Montréal Canada J3T 1J4
| | - Anthony C. Davison
- École Polytechnique Fédérale de Lausanne; EPFL-FSB-MATHAA-STAT; Station 8 1015 Lausanne Switzerland
| |
Collapse
|
343
|
Sun J, Zhao H. The application of sparse estimation of covariance matrix to quadratic discriminant analysis. BMC Bioinformatics 2015; 16:48. [PMID: 25886892 PMCID: PMC4355996 DOI: 10.1186/s12859-014-0443-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2014] [Accepted: 12/03/2014] [Indexed: 11/26/2022] Open
Abstract
Background Although Linear Discriminant Analysis (LDA) is commonly used for classification, it may not be directly applied in genomics studies due to the large p, small n problem in these studies. Different versions of sparse LDA have been proposed to address this significant challenge. One implicit assumption of various LDA-based methods is that the covariance matrices are the same across different classes. However, rewiring of genetic networks (therefore different covariance matrices) across different diseases has been observed in many genomics studies, which suggests that LDA and its variations may be suboptimal for disease classifications. However, it is not clear whether considering differing genetic networks across diseases can improve classification in genomics studies. Results We propose a sparse version of Quadratic Discriminant Analysis (SQDA) to explicitly consider the differences of the genetic networks across diseases. Both simulation and real data analysis are performed to compare the performance of SQDA with six commonly used classification methods. Conclusions SQDA provides more accurate classification results than other methods for both simulated and real data. Our method should prove useful for classification in genomics studies and other research settings, where covariances differ among classes.
Collapse
Affiliation(s)
- Jiehuan Sun
- Department of Biostatitics, Yale School of Publich Health, 60 College Street, New Haven, 06511, CT, USA.
| | - Hongyu Zhao
- Department of Biostatitics, Yale School of Publich Health, 60 College Street, New Haven, 06511, CT, USA.
| |
Collapse
|
344
|
A Projection Pursuit framework for supervised dimension reduction of high dimensional small sample datasets. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2014.07.057] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
345
|
Stein’s method in high dimensional classification and applications. Comput Stat Data Anal 2015. [DOI: 10.1016/j.csda.2014.08.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
346
|
Tian X, Wang X, Chen J. Network-constrained group lasso for high-dimensional multinomial classification with application to cancer subtype prediction. Cancer Inform 2015; 13:25-33. [PMID: 25635165 PMCID: PMC4295837 DOI: 10.4137/cin.s17686] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2014] [Revised: 11/26/2014] [Accepted: 11/27/2014] [Indexed: 01/08/2023] Open
Abstract
Classic multinomial logit model, commonly used in multiclass regression problem, is restricted to few predictors and does not take into account the relationship among variables. It has limited use for genomic data, where the number of genomic features far exceeds the sample size. Genomic features such as gene expressions are usually related by an underlying biological network. Efficient use of the network information is important to improve classification performance as well as the biological interpretability. We proposed a multinomial logit model that is capable of addressing both the high dimensionality of predictors and the underlying network information. Group lasso was used to induce model sparsity, and a network-constraint was imposed to induce the smoothness of the coefficients with respect to the underlying network structure. To deal with the non-smoothness of the objective function in optimization, we developed a proximal gradient algorithm for efficient computation. The proposed model was compared to models with no prior structure information in both simulations and a problem of cancer subtype prediction with real TCGA (the cancer genome atlas) gene expression data. The network-constrained mode outperformed the traditional ones in both cases.
Collapse
Affiliation(s)
- Xinyu Tian
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA
| | - Xuefeng Wang
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA. ; Department of Preventive Medicine, Stony Brook University, Stony Brook, NY, USA
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
347
|
Park JS, Choi SB, Chung JW, Kim SW, Kim DW. Classification of serous ovarian tumors based on microarray data using multicategory support vector machines. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2015; 2014:3430-3. [PMID: 25570728 DOI: 10.1109/embc.2014.6944360] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Ovarian cancer, the most fatal of reproductive cancers, is the fifth leading cause of death in women in the United States. Serous borderline ovarian tumors (SBOTs) are considered to be earlier or less malignant forms of serous ovarian carcinomas (SOCs). SBOTs are asymptomatic and progression to advanced stages is common. Using DNA microarray technology, we designed multicategory classification models to discriminate ovarian cancer subclasses. To develop multicategory classification models with optimal parameters and features, we systematically evaluated three machine learning algorithms and three feature selection methods using five-fold cross validation and a grid search. The study included 22 subjects with normal ovarian surface epithelial cells, 12 with SBOTs, and 79 with SOCs according to microarray data with 54,675 probe sets obtained from the National Center for Biotechnology Information gene expression omnibus repository. Application of the optimal model of support vector machines one-versus-rest with signal-to-noise as a feature selection method gave an accuracy of 97.3%, relative classifier information of 0.916, and a kappa index of 0.941. In addition, 5 features, including the expression of putative biomarkers SNTN and AOX1, were selected to differentiate between normal, SBOT, and SOC groups. An accurate diagnosis of ovarian tumor subclasses by application of multicategory machine learning would be cost-effective and simple to perform, and would ensure more effective subclass-targeted therapy.
Collapse
|
348
|
Gene Expression Data Classification Using Support Vector Machine and Mutual Information-based Gene Selection. ACTA ACUST UNITED AC 2015. [DOI: 10.1016/j.procs.2015.03.178] [Citation(s) in RCA: 114] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
349
|
Chakraborty R, Pal NR. Feature selection using a neural framework with controlled redundancy. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2015; 26:35-50. [PMID: 25532154 DOI: 10.1109/tnnls.2014.2308902] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
We first present a feature selection method based on a multilayer perceptron (MLP) neural network, called feature selection MLP (FSMLP). We explain how FSMLP can select essential features and discard derogatory and indifferent features. Such a method may pick up some useful but dependent (say correlated) features, all of which may not be needed. We then propose a general scheme for dealing with feature selection with "controlled redundancy" (CoR). The proposed scheme, named as FSMLP-CoR, can select features with a controlled redundancy both for classification and function approximation/prediction type problems. We have also proposed a new more effective training scheme named mFSMLP-CoR. The idea is general in nature and can be used with other learning schemes also. We demonstrate the effectiveness of the algorithms using several data sets including a synthetic data set. We also show that the selected features are adequate to solve the problem at hand. Here, we have considered a measure of linear dependency to control the redundancy. The use of nonlinear measures of dependency, such as mutual information, is straightforward. Here, there are some advantages of the proposed schemes. They do not require explicit evaluation of the feature subsets. Here, feature selection is integrated into designing of the decision-making system. Hence, it can look at all features together and pick up whatever is necessary. Our methods can account for possible nonlinear subtle interactions between features, as well as that between features, tools, and the problem being solved. They can also control the level of redundancy in the selected features. Of the two learning schemes, mFSMLP-CoR, not only improves the performance of the system, but also significantly reduces the dependency of the network's behavior on the initialization of connection weights.
Collapse
|
350
|
Bayesian variable selection in multinomial probit model for classifying high-dimensional data. Comput Stat 2014. [DOI: 10.1007/s00180-014-0540-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|