601
|
Chen KY, Chen LS, Chen MC, Lee CL. Using SVM based method for equipment fault detection in a thermal power plant. COMPUT IND 2011. [DOI: 10.1016/j.compind.2010.05.013] [Citation(s) in RCA: 100] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
602
|
Yu G, Li H, Ha S, Shih IM, Clarke R, Hoffman EP, Madhavan S, Xuan J, Wang Y. PUGSVM: a caBIG™ analytical tool for multiclass gene selection and predictive classification. ACTA ACUST UNITED AC 2010; 27:736-8. [PMID: 21186245 DOI: 10.1093/bioinformatics/btq721] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
UNLABELLED Phenotypic Up-regulated Gene Support Vector Machine (PUGSVM) is a cancer Biomedical Informatics Grid (caBIG™) analytical tool for multiclass gene selection and classification. PUGSVM addresses the problem of imbalanced class separability, small sample size and high gene space dimensionality, where multiclass gene markers are defined by the union of one-versus-everyone phenotypic upregulated genes, and used by a well-matched one-versus-rest support vector machine. PUGSVM provides a simple yet more accurate strategy to identify statistically reproducible mechanistic marker genes for characterization of heterogeneous diseases. AVAILABILITY http://www.cbil.ece.vt.edu/caBIG-PUGSVM.htm.
Collapse
Affiliation(s)
- Guoqiang Yu
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
603
|
Abstract
Recent studies suggest that the deregulation of pathways, rather than individual genes, may be critical in triggering carcinogenesis. The pathway deregulation is often caused by the simultaneous deregulation of more than one gene in the pathway. This suggests that robust gene pair combinations may exploit the underlying bio-molecular reactions that are relevant to the pathway deregulation and thus they could provide better biomarkers for cancer, as compared to individual genes. In order to validate this hypothesis, in this paper, we used gene pair combinations, called doublets, as input to the cancer classification algorithms, instead of the original expression values, and we showed that the classification accuracy was consistently improved across different datasets and classification algorithms. We validated the proposed approach using nine cancer datasets and five classification algorithms including Prediction Analysis for Microarrays (PAM), C4.5 Decision Trees (DT), Naive Bayesian (NB), Support Vector Machine (SVM), and k-Nearest Neighbor (k-NN).
Collapse
|
604
|
|
605
|
Stingo FC, Vannucci M. Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data. Bioinformatics 2010; 27:495-501. [PMID: 21159623 DOI: 10.1093/bioinformatics/btq690] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION Discriminant analysis is an effective tool for the classification of experimental units into groups. Here, we consider the typical problem of classifying subjects according to phenotypes via gene expression data and propose a method that incorporates variable selection into the inferential procedure, for the identification of the important biomarkers. To achieve this goal, we build upon a conjugate normal discriminant model, both linear and quadratic, and include a stochastic search variable selection procedure via an MCMC algorithm. Furthermore, we incorporate into the model prior information on the relationships among the genes as described by a gene-gene network. We use a Markov random field (MRF) prior to map the network connections among genes. Our prior model assumes that neighboring genes in the network are more likely to have a joint effect on the relevant biological processes. RESULTS We use simulated data to assess performances of our method. In particular, we compare the MRF prior to a situation where independent Bernoulli priors are chosen for the individual predictors. We also illustrate the method on benchmark datasets for gene expression. Our simulation studies show that employing the MRF prior improves on selection accuracy. In real data applications, in addition to identifying markers and improving prediction accuracy, we show how the integration of existing biological knowledge into the prior model results in an increased ability to identify genes with strong discriminatory power and also aids the interpretation of the results.
Collapse
|
606
|
Mischak H, Allmaier G, Apweiler R, Attwood T, Baumann M, Benigni A, Bennett SE, Bischoff R, Bongcam-Rudloff E, Capasso G, Coon JJ, D'Haese P, Dominiczak AF, Dakna M, Dihazi H, Ehrich JH, Fernandez-Llama P, Fliser D, Frokiaer J, Garin J, Girolami M, Hancock WS, Haubitz M, Hochstrasser D, Holman RR, Ioannidis JPA, Jankowski J, Julian BA, Klein JB, Kolch W, Luider T, Massy Z, Mattes WB, Molina F, Monsarrat B, Novak J, Peter K, Rossing P, Sánchez-Carbayo M, Schanstra JP, Semmes OJ, Spasovski G, Theodorescu D, Thongboonkerd V, Vanholder R, Veenstra TD, Weissinger E, Yamamoto T, Vlahou A. Recommendations for biomarker identification and qualification in clinical proteomics. Sci Transl Med 2010; 2:46ps42. [PMID: 20739680 DOI: 10.1126/scitranslmed.3001249] [Citation(s) in RCA: 249] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Clinical proteomics has yielded some early positive results-the identification of potential disease biomarkers-indicating the promise for this analytical approach to improve the current state of the art in clinical practice. However, the inability to verify some candidate molecules in subsequent studies has led to skepticism among many clinicians and regulatory bodies, and it has become evident that commonly encountered shortcomings in fundamental aspects of experimental design mainly during biomarker discovery must be addressed in order to provide robust data. In this Perspective, we assert that successful studies generally use suitable statistical approaches for biomarker definition and confirm results in independent test sets; in addition, we describe a brief set of practical and feasible recommendations that we have developed for investigators to properly identify and qualify proteomic biomarkers, which could also be used as reporting requirements. Such recommendations should help put proteomic biomarker discovery on the solid ground needed for turning the old promise into a new reality.
Collapse
|
607
|
Wu TT, Lange K. Multicategory vertex discriminant analysis for high-dimensional data. Ann Appl Stat 2010. [DOI: 10.1214/10-aoas345] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
608
|
Wang L, Chu F. Extracting very simple diagnostic rules from microarray data. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2010; 2010:807-10. [PMID: 21096115 DOI: 10.1109/iembs.2010.5626565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
We present an approach to deriving very simple classification rules from microarray data by first selecting very small gene subsets that can ensure highly accurate classification of cancers. Finding such minimum gene subsets can greatly reduce the computational load and "noise" arising from irrelevant genes. The derived simple classification rules allow for accurate diagnosis without the need for any classifiers. This work can simplify gene expression tests by including only a very small number of genes rather than thousands or tens of thousands of genes, which can significantly bring down the cost for cancer testing. These studies also call for further investigations into possible biological relationship between these small number of genes and cancer development and treatment. For example, we report the following simple, and yet 100% accurate, diagnostic rules involving only 2 genes to separate the 3 types of lymphoma patients: the patient has diffuse large B-cell lymphoma (DLBCL), if and only if the expression level of gene GENE1622X is greater than -0.75; the patient has chronic lymphocytic leukaemia (CLL), if and only if the expression level of gene GENE540X is less than -1; and the patient has follicular lymphoma (FL) otherwise, i.e., if and only if the expression level of gene GENE1622X is less than -0.75 and the expression level of gene GENE540X is greater than -1.
Collapse
Affiliation(s)
- Lipo Wang
- School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798.
| | | |
Collapse
|
609
|
Pang H, Ebisu K, Watanabe E, Sue LY, Tong T. Analysing breast cancer microarrays from African Americans using shrinkage-based discriminant analysis. Hum Genomics 2010; 5:5-16. [PMID: 21106486 PMCID: PMC3042882 DOI: 10.1186/1479-7364-5-1-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Breast cancer tumours among African Americans are usually more aggressive than those found in Caucasian populations. African-American patients with breast cancer also have higher mortality rates than Caucasian women. A better understanding of the disease aetiology of these breast cancers can help to improve and develop new methods for cancer prevention, diagnosis and treatment. The main goal of this project was to identify genes that help differentiate between oestrogen receptor-positive and -negative samples among a small group of African-American patients with breast cancer. Breast cancer microarrays from one of the largest genomic consortiums were analysed using 13 African-American and 201 Caucasian samples with oestrogen receptor status. We used a shrinkage-based classification method to identify genes that were informative in discriminating between oestrogen receptor-positive and -negative samples. Subset analysis and permutation were performed to obtain a set of genes unique to the African-American population. We identified a set of 156 probe sets, which gave a misclassification rate of 0.16 in distinguishing between oestrogen receptor-positive and -negative patients. The biological relevance of our findings was explored through literature-mining techniques and pathway mapping. An independent dataset was used to validate our findings and we found that the top ten genes mapped onto this dataset gave a misclassification rate of 0.15. The described method allows us best to utilise the information available from small sample size microarray data in the context of ethnic minorities.
Collapse
Affiliation(s)
- Herbert Pang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27710, USA
| | | | | | | | | |
Collapse
|
610
|
Abstract
For medical classification problems, it is often desirable to have a probability associated with each class. Probabilistic classifiers have received relatively little attention for small n large p classification problems despite of their importance in medical decision making. In this paper, we introduce 2 criteria for assessment of probabilistic classifiers: well-calibratedness and refinement and develop corresponding evaluation measures. We evaluated several published high-dimensional probabilistic classifiers and developed 2 extensions of the Bayesian compound covariate classifier. Based on simulation studies and analysis of gene expression microarray data, we found that proper probabilistic classification is more difficult than deterministic classification. It is important to ensure that a probabilistic classifier is well calibrated or at least not "anticonservative" using the methods developed here. We provide this evaluation for several probabilistic classifiers and also evaluate their refinement as a function of sample size under weak and strong signal conditions. We also present a cross-validation method for evaluating the calibration and refinement of any probabilistic classifier on any data set.
Collapse
Affiliation(s)
- Kyung In Kim
- Biometric Research Branch, National Cancer Institute, 9000 Rockville Pike, MSC 7434, Bethesda, MD 20892-7434, USA
| | | |
Collapse
|
611
|
Huh S, Lee D. Linear discriminant analysis for signatures. IEEE TRANSACTIONS ON NEURAL NETWORKS 2010; 21:1990-6. [PMID: 21075720 DOI: 10.1109/tnn.2010.2090047] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
We propose signature linear discriminant analysis (signature-LDA) as an extension of LDA that can be applied to signatures, which are known to be more informative representations of local image features than vector representations, such as visual word histograms. Based on earth mover's distances between signatures, signature-LDA does not require vectorization of local image features in contrast to LDA, which is one of the main limitations of classical LDA. Therefore, signature-LDA minimizes the loss of intrinsic information of local image features while selecting more discriminating features using label information. Empirical evidence on texture databases shows that signature-LDA improves upon state-of-the-art approaches for texture image classification and outperforms other feature selection methods for local image features.
Collapse
Affiliation(s)
- Seungil Huh
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA.
| | | |
Collapse
|
612
|
Yang X, Lee Y, Fan H, Sun X, Lussier YA. Identification of common microRNA-mRNA regulatory biomodules in human epithelial cancers. ACTA ACUST UNITED AC 2010; 55:3576-3589. [PMID: 21340045 DOI: 10.1007/s11434-010-4051-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The complex regulatory network between microRNAs and gene expression remains unclear domain of active research. We proposed to address in part this complex regulation with a novel approach for the genome-wide identification of biomodules derived from paired microRNA and mRNA profiles, which could reveal correlations associated with a complex network of de-regulation in human cancer. Two published expression datasets for 68 samples with 11 distinct types of epithelial cancers and 21 samples of normal tissues were used, containing microRNA expression (Lu et al. Nature Letters 2005) and gene expression (Ramaswarmy et al. PNAS 2001) profiles, respectively. As results, the microRNA expression used jointly with mRNA expression can provide better classifiers of epithelial cancers against normal epithelial tissue than either dataset alone (p=1×10(-10), F-Test). We identified a combination of six microRNA-mRNA biomodules that optimally classified epithelial cancers from normal epithelial tissue (total accuracy = 93.3%; 95% confidence intervals: 86% - 97%), using penalized logistic regression (PLR) algorithm and three-fold cross-validation. Three of these biomodules are individually sufficient to cluster epithelial cancers from normal tissue using mutual information distance. The biomodules contain 10 distinct microRNAs and 98 distinct genes, including well known tumor markers such as miR-15a, miR-30e, IRAK1, TGFBR2, DUSP16, CDC25B and PDCD2. In addition, there is a significant enrichment (Fisher's exact test p=3×10(-10)) between putative microRNA-target gene pairs reported in five microRNA target databases and the inversely correlated micro-RNA-mRNA pairs in the biomodules. Further, microRNAs and genes in the biomodules were found in abstracts mentioning epithelial cancers (Fisher Exact Test, unadjusted p<0.05). Taken together, these results strongly suggest that the discovered microRNA-mRNA biomodules correspond to regulatory mechanisms common to human epithelial cancer samples. In conclusion, we developed and evaluated a novel comprehensive method to systematically identify, on a genome scale, microRNA-mRNA expression biomodules common to distinct cancers of the same tissue. These biomodules also comprise novel microRNA and genes as well as an imputed regulatory network, which may accelerate the work of cancer biologists as large regulatory maps of cancers can be drawn efficiently for hypothesis generation.
Collapse
Affiliation(s)
- Xinan Yang
- State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096,China
| | | | | | | | | |
Collapse
|
613
|
Yuan X, Jonker MJ, de Wilde J, Verhoef A, Wittink FR, van Benthem J, Bessems JG, Hakkert BC, Kuiper RV, van Steeg H, Breit TM, Luijten M. Finding maximal transcriptome differences between reprotoxic and non-reprotoxic phthalate responses in rat testis. J Appl Toxicol 2010; 31:421-30. [DOI: 10.1002/jat.1601] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2010] [Revised: 08/30/2010] [Accepted: 08/31/2010] [Indexed: 11/07/2022]
|
614
|
k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. THE PHARMACOGENOMICS JOURNAL 2010; 10:292-309. [PMID: 20676068 PMCID: PMC2920072 DOI: 10.1038/tpj.2010.56] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
In the clinical application of genomic data analysis and modeling, a number of factors contribute to the performance of disease classification and clinical outcome prediction. This study focuses on the k-nearest neighbor (KNN) modeling strategy and its clinical use. Although KNN is simple and clinically appealing, large performance variations were found among experienced data analysis teams in the MicroArray Quality Control Phase II (MAQC-II) project. For clinical end points and controls from breast cancer, neuroblastoma and multiple myeloma, we systematically generated 463 320 KNN models by varying feature ranking method, number of features, distance metric, number of neighbors, vote weighting and decision threshold. We identified factors that contribute to the MAQC-II project performance variation, and validated a KNN data analysis protocol using a newly generated clinical data set with 478 neuroblastoma patients. We interpreted the biological and practical significance of the derived KNN models, and compared their performance with existing clinical factors.
Collapse
|
615
|
|
616
|
|
617
|
Schell MJ. Identifying Key Statistical Papers From 1985 to 2002 Using Citation Data for Applied Biostatisticians. AM STAT 2010. [DOI: 10.1198/tast.2010.08250] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
618
|
Karyagyna AS, Vassiliev MO, Ershova AS, Nurtdinov RN, Lossev IS. Probe-Level Universal Search (PLUS) algorithm for gender differentiation in affymetrix datasets. J Bioinform Comput Biol 2010; 8:553-77. [PMID: 20556862 DOI: 10.1142/s0219720010004823] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2009] [Revised: 01/22/2010] [Accepted: 02/12/2010] [Indexed: 11/18/2022]
Abstract
Affymetrix microarrays measure gene expression based on the intensity of hybridization of a panel of oligonucleotide probes (probe set) with mRNA. The signals from all probes within a probe set are converted into a single measure that represents the expression value of a gene. This step diminishes the number of independently measured parameters and eliminates from consideration individual "good-working" probes. We propose a new feature selection algorithm (Probe Level Universal Search or PLUS algorithm) for probe-level analysis of gene expression datasets. The algorithm evaluates the intensities of perfect-match Affymetrix probes individually and selects probes that allow one to distinguish two given classes of samples. The algorithm was used to differentiate the samples according to their gender ("gender differentiation"). The universal gender differentiating set of 3' Gene Affymetrix microarray probes was selected; the set consists of 38 probes from XIST gene of X-chromosome and 17 probes from five Y-chromosome genes: RPS4Y1, EIF1A, DDX3Y, JARID1D and USP9Y. The selection procedure based on the probes selected by PLUS algorithm differentiates the sex chromosome karyotype of the sample, reveals samples with incorrect gender labels and samples from patients with hereditary syndromes or cancer-associated chromosome abnormalities.
Collapse
Affiliation(s)
- Anna S Karyagyna
- NF Gamaleya Research Institute of Epidemiology and Microbiology, Russian Academy of Medical Sciences, Institute of Agricultural Biotechnology, Moscow, Russia.
| | | | | | | | | |
Collapse
|
619
|
Blagus R, Lusa L. Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics 2010; 11:523. [PMID: 20961420 PMCID: PMC3098087 DOI: 10.1186/1471-2105-11-523] [Citation(s) in RCA: 99] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2010] [Accepted: 10/20/2010] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance. RESULTS Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. CONCLUSIONS Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.
Collapse
Affiliation(s)
- Rok Blagus
- Institute for Biostatistics and Medical Informatics, University of Ljubljana, Vrazov trg 2, Ljubljana, Slovenia
| | | |
Collapse
|
620
|
Lauss M, Frigyesi A, Ryden T, Höglund M. Robust assignment of cancer subtypes from expression data using a uni-variate gene expression average as classifier. BMC Cancer 2010; 10:532. [PMID: 20925936 PMCID: PMC2966465 DOI: 10.1186/1471-2407-10-532] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2010] [Accepted: 10/06/2010] [Indexed: 11/10/2022] Open
Abstract
Background Genome wide gene expression data is a rich source for the identification of gene signatures suitable for clinical purposes and a number of statistical algorithms have been described for both identification and evaluation of such signatures. Some employed algorithms are fairly complex and hence sensitive to over-fitting whereas others are more simple and straight forward. Here we present a new type of simple algorithm based on ROC analysis and the use of metagenes that we believe will be a good complement to existing algorithms. Results The basis for the proposed approach is the use of metagenes, instead of collections of individual genes, and a feature selection using AUC values obtained by ROC analysis. Each gene in a data set is assigned an AUC value relative to the tumor class under investigation and the genes are ranked according to these values. Metagenes are then formed by calculating the mean expression level for an increasing number of ranked genes, and the metagene expression value that optimally discriminates tumor classes in the training set is used for classification of new samples. The performance of the metagene is then evaluated using LOOCV and balanced accuracies. Conclusions We show that the simple uni-variate gene expression average algorithm performs as well as several alternative algorithms such as discriminant analysis and the more complex approaches such as SVM and neural networks. The R package rocc is freely available at http://cran.r-project.org/web/packages/rocc/index.html.
Collapse
Affiliation(s)
- Martin Lauss
- Department of Oncology, Clinical Sciences, Lund University and Lund University Hospital, LUND, Sweden
| | | | | | | |
Collapse
|
621
|
Cheng Q. A sparse learning machine for high-dimensional data with application to microarray gene analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:636-646. [PMID: 21030731 DOI: 10.1109/tcbb.2009.8] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Extracting features from high-dimensional data is a critically important task for pattern recognition and machine learning applications. High-dimensional data typically have much more variables than observations, and contain significant noise, missing components, or outliers. Features extracted from high-dimensional data need to be discriminative, sparse, and can capture essential characteristics of the data. In this paper, we present a way to constructing multivariate features and then classify the data into proper classes. The resulting small subset of features is nearly the best in the sense of Greenshtein's persistence; however, the estimated feature weights may be biased. We take a systematic approach for correcting the biases. We use conjugate gradient-based primal-dual interior-point techniques for large-scale problems. We apply our procedure to microarray gene analysis. The effectiveness of our method is confirmed by experimental results.
Collapse
Affiliation(s)
- Qiang Cheng
- Computer Science Department, Southern Illinois University Carbondale, Carbondale, IL 62901, USA.
| |
Collapse
|
622
|
Takahashi H, Morioka R, Ito R, Oshima T, Altaf-Ul-Amin M, Ogasawara N, Kanaya S. Dynamics of time-lagged gene-to-metabolite networks of Escherichia coli elucidated by integrative omics approach. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2010; 15:15-23. [PMID: 20863252 DOI: 10.1089/omi.2010.0074] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
In the postgenomics era, integrative analysis of several "omics" data is absolutely required for understanding the cell as a system. Integrative analysis of transcriptomics and metabolomics can lead to elucidation of gene-to-metabolite networks. When integrating different time series "omics" data, it is necessary to take into consideration a time lag between those data. In the present study, we conducted an integrative analysis of time series transcriptomics and metabolomics data of Escherichia coli generated by cDNA microarray and Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR/MS), respectively. We identified a 60-min time lag between transition points of transcriptomics and metabolomics data by using a Linear Dynamical System. Furthermore, we investigated gene-to-metabolite correlations in the context of time lag, obtained the maximum number of correlated pairs at transcripts leading 60-min time lag, and finally revealed gene-to-metabolite relations in the phospholipid biosynthesis pathway. Taking into consideration the time lag between transcriptomics and metabolomics data in time series analysis could unravel novel gene-to-metabolite relations. According to gene-to-metabolite correlations, phosphatidylglycerol plays a more critical role for membrane balance than phosphatidylethanolamine in E. coli.
Collapse
Affiliation(s)
- Hiroki Takahashi
- Department of Bioinformatics and Genomics, Graduate School of Information Science, Nara Institute of Science and Technology, Nara, Japan
| | | | | | | | | | | | | |
Collapse
|
623
|
Li B, Zheng CH, Huang DS, Zhang L, Han K. Gene expression data classification using locally linear discriminant embedding. Comput Biol Med 2010; 40:802-10. [PMID: 20864095 DOI: 10.1016/j.compbiomed.2010.08.003] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2009] [Revised: 06/12/2010] [Accepted: 08/17/2010] [Indexed: 11/16/2022]
Abstract
Gene expression data collected from DNA microarray are characterized by a large amount of variables (genes), but with only a small amount of observations (experiments). In this paper, manifold learning method is proposed to map the gene expression data to a low dimensional space, and then explore the intrinsic structure of the features so as to classify the microarray data more accurately. The proposed algorithm can project the gene expression data into a subspace with high intra-class compactness and inter-class separability. Experimental results on six DNA microarray datasets demonstrated that our method is efficient for discriminant feature extraction and gene expression data classification. This work is a meaningful attempt to analyze microarray data using manifold learning method; there should be much room for the application of manifold learning to bioinformatics due to its performance.
Collapse
Affiliation(s)
- Bo Li
- Intelligent Computing Laboratory, Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China
| | | | | | | | | |
Collapse
|
624
|
Baker SG. Simple and flexible classification of gene expression microarrays via Swirls and Ripples. BMC Bioinformatics 2010; 11:452. [PMID: 20825641 PMCID: PMC2949887 DOI: 10.1186/1471-2105-11-452] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2010] [Accepted: 09/08/2010] [Indexed: 11/23/2022] Open
Abstract
Background A simple classification rule with few genes and parameters is desirable when applying a classification rule to new data. One popular simple classification rule, diagonal discriminant analysis, yields linear or curved classification boundaries, called Ripples, that are optimal when gene expression levels are normally distributed with the appropriate variance, but may yield poor classification in other situations. Results A simple modification of diagonal discriminant analysis yields smooth highly nonlinear classification boundaries, called Swirls, that sometimes outperforms Ripples. In particular, if the data are normally distributed with different variances in each class, Swirls substantially outperforms Ripples when using a pooled variance to reduce the number of parameters. The proposed classification rule for two classes selects either Swirls or Ripples after parsimoniously selecting the number of genes and distance measures. Applications to five cancer microarray data sets identified predictive genes related to the tissue organization theory of carcinogenesis. Conclusion The parsimonious selection of classifiers coupled with the selection of either Swirls or Ripples provides a good basis for formulating a simple, yet flexible, classification rule. Open source software is available for download.
Collapse
Affiliation(s)
- Stuart G Baker
- Biometry Research Group, Division of Cancer Prevention, National Cancer Institute, EPN 3131, 6130 Executive Blvd MSC 7354, Bethesda, MD 20892-7354, USA.
| |
Collapse
|
625
|
Ranganathan Y, Borges RM. Reducing the babel in plant volatile communication: using the forest to see the trees. PLANT BIOLOGY (STUTTGART, GERMANY) 2010; 12:735-42. [PMID: 20701696 DOI: 10.1111/j.1438-8677.2009.00278.x] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
While plants of a single species emit a diversity of volatile organic compounds (VOCs) to attract or repel interacting organisms, these specific messages may be lost in the midst of the hundreds of VOCs produced by sympatric plants of different species, many of which may have no signal content. Receivers must be able to reduce the babel or noise in these VOCs in order to correctly identify the message. For chemical ecologists faced with vast amounts of data on volatile signatures of plants in different ecological contexts, it is imperative to employ accurate methods of classifying messages, so that suitable bioassays may then be designed to understand message content. We demonstrate the utility of 'Random Forests' (RF), a machine-learning algorithm, for the task of classifying volatile signatures and choosing the minimum set of volatiles for accurate discrimination, using data from sympatric Ficus species as a case study. We demonstrate the advantages of RF over conventional classification methods such as principal component analysis (PCA), as well as data-mining algorithms such as support vector machines (SVM), diagonal linear discriminant analysis (DLDA) and k-nearest neighbour (KNN) analysis. We show why a tree-building method such as RF, which is increasingly being used by the bioinformatics, food technology and medical community, is particularly advantageous for the study of plant communication using volatiles, dealing, as it must, with abundant noise.
Collapse
Affiliation(s)
- Y Ranganathan
- Centre for Ecological Sciences, Indian Institute of Science, Bangalore, India
| | | |
Collapse
|
626
|
Hernández-Lobato D, Hernández-Lobato JM, Suárez A. Expectation Propagation for microarray data classification. Pattern Recognit Lett 2010. [DOI: 10.1016/j.patrec.2010.05.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
627
|
Asymptotic properties of the EPMC for modified linear discriminant analysis when sample size and dimension are both large. J Stat Plan Inference 2010. [DOI: 10.1016/j.jspi.2010.03.038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
628
|
McNicholas PD, Murphy TB. Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 2010; 26:2705-12. [PMID: 20802251 DOI: 10.1093/bioinformatics/btq498] [Citation(s) in RCA: 146] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In recent years, work has been carried out on clustering gene expression microarray data. Some approaches are developed from an algorithmic viewpoint whereas others are developed via the application of mixture models. In this article, a family of eight mixture models which utilizes the factor analysis covariance structure is extended to 12 models and applied to gene expression microarray data. This modelling approach builds on previous work by introducing a modified factor analysis covariance structure, leading to a family of 12 mixture models, including parsimonious models. This family of models allows for the modelling of the correlation between gene expression levels even when the number of samples is small. Parameter estimation is carried out using a variant of the expectation-maximization algorithm and model selection is achieved using the Bayesian information criterion. This expanded family of Gaussian mixture models, known as the expanded parsimonious Gaussian mixture model (EPGMM) family, is then applied to two well-known gene expression data sets. RESULTS The performance of the EPGMM family of models is quantified using the adjusted Rand index. This family of models gives very good performance, relative to existing popular clustering techniques, when applied to real gene expression microarray data. AVAILABILITY The reduced, preprocessed data that were analysed are available at www.paulmcnicholas.info
Collapse
Affiliation(s)
- Paul D McNicholas
- Department of Mathematics & Statistics, University of Guelph, Guelph, Ontario, Canada.
| | | |
Collapse
|
629
|
Campain A, Yang YH. Comparison study of microarray meta-analysis methods. BMC Bioinformatics 2010; 11:408. [PMID: 20678237 PMCID: PMC2922198 DOI: 10.1186/1471-2105-11-408] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2010] [Accepted: 08/03/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Meta-analysis methods exist for combining multiple microarray datasets. However, there are a wide range of issues associated with microarray meta-analysis and a limited ability to compare the performance of different meta-analysis methods. RESULTS We compare eight meta-analysis methods, five existing methods, two naive methods and a novel approach (mDEDS). Comparisons are performed using simulated data and two biological case studies with varying degrees of meta-analysis complexity. The performance of meta-analysis methods is assessed via ROC curves and prediction accuracy where applicable. CONCLUSIONS Existing meta-analysis methods vary in their ability to perform successful meta-analysis. This success is very dependent on the complexity of the data and type of analysis. Our proposed method, mDEDS, performs competitively as a meta-analysis tool even as complexity increases. Because of the varying abilities of compared meta-analysis methods, care should be taken when considering the meta-analysis method used for particular research.
Collapse
Affiliation(s)
- Anna Campain
- School of Mathematics and Statistics, Center of Mathematical Biology, University of Sydney, F07 Sydney, NSW 2006, Australia.
| | | |
Collapse
|
630
|
Lian H. Sparse Bayesian hierarchical modeling of high-dimensional clustering problems. J MULTIVARIATE ANAL 2010. [DOI: 10.1016/j.jmva.2010.03.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
631
|
Yang P, Zhang Z, Zhou BB, Zomaya AY. A clustering based hybrid system for biomarker selection and sample classification of mass spectrometry data. Neurocomputing 2010. [DOI: 10.1016/j.neucom.2010.02.022] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
632
|
|
633
|
Zhang D, Liu L. Multiple Comparisons in Microarray Data Analysis. Stat Biopharm Res 2010. [DOI: 10.1198/sbr.2009.08086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
634
|
The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol 2010; 28:827-38. [PMID: 20676074 DOI: 10.1038/nbt.1665] [Citation(s) in RCA: 604] [Impact Index Per Article: 40.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2010] [Accepted: 06/30/2010] [Indexed: 11/09/2022]
Abstract
Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.
Collapse
|
635
|
|
636
|
Young Park S, Liu Y, Liu D, Scholl P. Multicategory composite least squares classifiers. Stat Anal Data Min 2010; 3:272-286. [DOI: 10.1002/sam.10081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
637
|
Finding biomarker signatures in pooled sample designs: a simulation framework for methodological comparisons. Adv Bioinformatics 2010:318573. [PMID: 20671968 PMCID: PMC2909718 DOI: 10.1155/2010/318573] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2009] [Accepted: 05/05/2010] [Indexed: 11/18/2022] Open
Abstract
Detection of discriminating patterns in gene expression data can be accomplished by using various methods of statistical learning. It has been proposed that sample pooling in this context would have negative effects; however, pooling cannot always be avoided. We propose a simulation framework to explicitly investigate the parameters of patterns, experimental design, noise, and choice of method in order to find out which effects on classification performance are to be expected. We use a two-group classification task and simulated gene expression data with independent differentially expressed genes as well as bivariate linear patterns and the combination of both. Our results show a clear increase of prediction error with pool size. For pooled training sets powered partial least squares discriminant analysis outperforms discriminance analysis, random forests, and support vector machines with linear or radial kernel for two of three simulated scenarios. The proposed simulation approach can be implemented to systematically investigate a number of additional scenarios of practical interest.
Collapse
|
638
|
Huang J, Fang H, Fan X. Decision forest for classification of gene expression data. Comput Biol Med 2010; 40:698-704. [PMID: 20591424 DOI: 10.1016/j.compbiomed.2010.06.004] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2009] [Revised: 05/24/2010] [Accepted: 06/12/2010] [Indexed: 11/19/2022]
Abstract
This study attempts to propose an improved decision forest (IDF) with an integrated graphical user interface. Based on four gene expression data sets, the IDF not only outperforms the original decision forest, but also is superior or comparable to other state-of-the-art machine learning methods, especially in dealing with high dimensional data. With an integrated built-in feature selection (FS) mechanism and fewer parameters to tune, it can be trained more efficiently than methods such as support vector machine, and can be built with much fewer trees than other popular tree-based ensemble methods. Moreover, it suffers less from the curse of dimensionality.
Collapse
Affiliation(s)
- Jianping Huang
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, 388 YuHangTang Road, Hangzhou 310058, China
| | | | | |
Collapse
|
639
|
Xuan J, Wang Y, Dong Y, Feng Y, Wang B, Khan J, Bakay M, Wang Z, Pachman L, Winokur S, Chen YW, Clarke R, Hoffman E. Gene selection for multiclass prediction by weighted Fisher criterion. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2010:64628. [PMID: 17713593 PMCID: PMC3171347 DOI: 10.1155/2007/64628] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/30/2006] [Revised: 12/16/2006] [Accepted: 03/20/2007] [Indexed: 12/14/2022]
Abstract
Gene expression profiling has been widely used to study molecular signatures of many diseases and to develop molecular diagnostics for disease prediction. Gene selection, as an important step for improved diagnostics, screens tens of thousands of genes and identifies a small subset that discriminates between disease types. A two-step gene selection method is proposed to identify informative gene subsets for accurate classification of multiclass phenotypes. In the first step, individually discriminatory genes (IDGs) are identified by using one-dimensional weighted Fisher criterion (wFC). In the second step, jointly discriminatory genes (JDGs) are selected by sequential search methods, based on their joint class separability measured by multidimensional weighted Fisher criterion (wFC). The performance of the selected gene subsets for multiclass prediction is evaluated by artificial neural networks (ANNs) and/or support vector machines (SVMs). By applying the proposed IDG/JDG approach to two microarray studies, that is, small round blue cell tumors (SRBCTs) and muscular dystrophies (MDs), we successfully identified a much smaller yet efficient set of JDGs for diagnosing SRBCTs and MDs with high prediction accuracies (96.9% for SRBCTs and 92.3% for MDs, resp.). These experimental results demonstrated that the two-step gene selection method is able to identify a subset of highly discriminative genes for improved multiclass prediction.
Collapse
Affiliation(s)
- Jianhua Xuan
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Yue Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Yibin Dong
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Yuanjian Feng
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Bin Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Javed Khan
- Department of Pediatric Oncology, National Cancer Institute, Gaithersburg, MD 20877, USA
| | - Maria Bakay
- Research Center for Genetic Medicine, Children's National Medical Center, Washington, DC 20010, USA
| | - Zuyi Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
- Research Center for Genetic Medicine, Children's National Medical Center, Washington, DC 20010, USA
| | - Lauren Pachman
- Disease Pathogenesis Program, Children's Memorial Research Center, Chicago, IL 60614, USA
| | - Sara Winokur
- Department of Biological Chemistry, University of California, Irvine, CA 92697, USA
| | - Yi-Wen Chen
- Research Center for Genetic Medicine, Children's National Medical Center, Washington, DC 20010, USA
| | - Robert Clarke
- Lombardi Cancer Center, Georgetown University, Washington, DC 20007, USA
| | - Eric Hoffman
- Research Center for Genetic Medicine, Children's National Medical Center, Washington, DC 20010, USA
| |
Collapse
|
640
|
Jelizarow M, Guillemot V, Tenenhaus A, Strimmer K, Boulesteix AL. Over-optimism in bioinformatics: an illustration. Bioinformatics 2010; 26:1990-8. [PMID: 20581402 DOI: 10.1093/bioinformatics/btq323] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION In statistical bioinformatics research, different optimization mechanisms potentially lead to 'over-optimism' in published papers. So far, however, a systematic critical study concerning the various sources underlying this over-optimism is lacking. RESULTS We present an empirical study on over-optimism using high-dimensional classification as example. Specifically, we consider a 'promising' new classification algorithm, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. While this approach yields poor results in terms of error rate, we quantitatively demonstrate that it can artificially seem superior to existing approaches if we 'fish for significance'. The investigated sources of over-optimism include the optimization of datasets, of settings, of competing methods and, most importantly, of the method's characteristics. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should always be demonstrated on independent validation data. AVAILABILITY The R codes and relevant data can be downloaded from http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/overoptimism/, such that the study is completely reproducible.
Collapse
Affiliation(s)
- Monika Jelizarow
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Munich, Germany
| | | | | | | | | |
Collapse
|
641
|
Hopcroft LEM, McBride MW, Harris KJ, Sampson AK, McClure JD, Graham D, Young G, Holyoake TL, Girolami MA, Dominiczak AF. Predictive response-relevant clustering of expression data provides insights into disease processes. Nucleic Acids Res 2010; 38:6831-40. [PMID: 20571087 PMCID: PMC2978340 DOI: 10.1093/nar/gkq550] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
This article describes and illustrates a novel method of microarray data analysis that couples model-based clustering and binary classification to form clusters of `response-relevant' genes; that is, genes that are informative when discriminating between the different values of the response. Predictions are subsequently made using an appropriate statistical summary of each gene cluster, which we call the `meta-covariate' representation of the cluster, in a probit regression model. We first illustrate this method by analysing a leukaemia expression dataset, before focusing closely on the meta-covariate analysis of a renal gene expression dataset in a rat model of salt-sensitive hypertension. We explore the biological insights provided by our analysis of these data. In particular, we identify a highly influential cluster of 13 genes—including three transcription factors (Arntl, Bhlhe41 and Npas2)—that is implicated as being protective against hypertension in response to increased dietary sodium. Functional and canonical pathway analysis of this cluster using Ingenuity Pathway Analysis implicated transcriptional activation and circadian rhythm signalling, respectively. Although we illustrate our method using only expression data, the method is applicable to any high-dimensional datasets. Expression data are available at ArrayExpress (accession number E-MEXP-2514) and code is available at http://www.dcs.gla.ac.uk/inference/metacovariateanalysis/.
Collapse
Affiliation(s)
- Lisa E M Hopcroft
- Inference Group, Department of Computing Science, University of Glasgow, and Gartnavel General Hospital, 1053 Great Western Road, Glasgow G12 0YN, UK
| | | | | | | | | | | | | | | | | | | |
Collapse
|
642
|
Shao L, Wu L, Fang H, Tong W, Fan X. Does applicability domain exist in microarray-based genomic research? PLoS One 2010; 5:e11055. [PMID: 20548774 PMCID: PMC2883551 DOI: 10.1371/journal.pone.0011055] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2010] [Accepted: 05/12/2010] [Indexed: 11/18/2022] Open
Abstract
Constructing an accurate predictive model for clinical decision-making on the basis of a relatively small number of tumor samples with high-dimensional microarray data remains a very challenging problem. The validity of such models has been seriously questioned due to their failure in clinical validation using independent samples. Besides the statistical issues such as selection bias, some studies further implied the probable reason was improper sample selection that did not resemble the genomic space defined by the training population. Assuming that predictions would be more reliable for interpolation than extrapolation, we set to investigate the impact of applicability domain (AD) on model performance in microarray-based genomic research by evaluating and comparing model performance for samples with different extrapolation degrees. We found that the issue of applicability domain may not exist in microarray-based genomic research for clinical applications. Therefore, it is not practicable to improve model validity based on applicability domain.
Collapse
Affiliation(s)
- Li Shao
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Leihong Wu
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Hong Fang
- Z-Tech Corporation, an ICF International Company at the National Center for Toxicological Research/United States Food and Drug Administration, Jefferson, Arkansas, United States of America
| | - Weida Tong
- National Center for Toxicological Research, United States Food and Drug Administration, Jefferson, Arkansas, United States of America
| | - Xiaohui Fan
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| |
Collapse
|
643
|
Liedtke C, Wang J, Tordai A, Symmans WF, Hortobagyi GN, Kiesel L, Hess K, Baggerly KA, Coombes KR, Pusztai L. Clinical evaluation of chemotherapy response predictors developed from breast cancer cell lines. Breast Cancer Res Treat 2010; 121:301-9. [PMID: 19603265 DOI: 10.1007/s10549-009-0445-7] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2009] [Accepted: 06/10/2009] [Indexed: 02/03/2023]
Abstract
The goal of this study was to develop pharmacogenomic predictors in response to standard chemotherapy drugs in breast cancer cell lines and test their predictive value in patients who received treatment with the same drugs. Nineteen human breast cancer cell lines were tested for sensitivity to paclitaxel (T), 5-fluorouracil (F), doxorubicin (A) and cyclophosphamide (C) in vitro. Baseline gene expression data were obtained for each cell line with Affymetrix U133A gene chips, and multigene predictors of sensitivity were derived for each drug separately. These predictors were applied individually and in combination to human gene expression data generated with the same Affymetrix platform from fine needle aspiration specimens of 133 stage I-III breast cancers. Tumor samples were obtained at baseline, and each patient received 6 months of preoperative TFAC chemotherapy followed by surgery. Cell line-derived prediction results were correlated with the observed pathologic response to chemotherapy. Statistically robust differentially expressed genes between sensitive and resistant cells could only be found for paclitaxel. False discovery rates associated with the informative genes were high for all other drugs. For each drug, the top 100 differentially expressed genes were combined into a drug-specific response predictor. When these cell line-based predictors were applied to patient data, there was no significant correlation between observed response and predicted response either for individual drug predictors or combined predictions. Cell line-derived predictors of response to four commonly used chemotherapy drugs did not predict response accurately in patients.
Collapse
Affiliation(s)
- Cornelia Liedtke
- Department of Breast Medical Oncology, The University of Texas MD Anderson Cancer Center, BO Box 301439, Houston, TX 77030-1439, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
644
|
Zhao Y, Simon R. Development and validation of predictive indices for a continuous outcome using gene expression profiles. Cancer Inform 2010; 9:105-14. [PMID: 20523915 PMCID: PMC2879606 DOI: 10.4137/cin.s3805] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
There have been relatively few publications using linear regression models to predict a continuous response based on microarray expression profiles. Standard linear regression methods are problematic when the number of predictor variables exceeds the number of cases. We have evaluated three linear regression algorithms that can be used for the prediction of a continuous response based on high dimensional gene expression data. The three algorithms are the least angle regression (LAR), the least absolute shrinkage and selection operator (LASSO), and the averaged linear regression method (ALM). All methods are tested using simulations based on a real gene expression dataset and analyses of two sets of real gene expression data and using an unbiased complete cross validation approach. Our results show that the LASSO algorithm often provides a model with somewhat lower prediction error than the LAR method, but both of them perform more efficiently than the ALM predictor. We have developed a plug-in for BRB-ArrayTools that implements the LAR and the LASSO algorithms with complete cross-validation.
Collapse
Affiliation(s)
- Yingdong Zhao
- Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, USA.
| | | |
Collapse
|
645
|
Forest classification trees and forest support vector machines algorithms: Demonstration using microarray data. Comput Biol Med 2010; 40:519-24. [DOI: 10.1016/j.compbiomed.2010.03.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2009] [Revised: 01/09/2010] [Accepted: 03/22/2010] [Indexed: 11/22/2022]
|
646
|
|
647
|
Zhang W, Robbins K, Wang Y, Bertrand K, Rekaya R. A jackknife-like method for classification and uncertainty assessment of multi-category tumor samples using gene expression information. BMC Genomics 2010; 11:273. [PMID: 20429942 PMCID: PMC2876124 DOI: 10.1186/1471-2164-11-273] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2009] [Accepted: 04/29/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The use of gene expression profiling for the classification of human cancer tumors has been widely investigated. Previous studies were successful in distinguishing several tumor types in binary problems. As there are over a hundred types of cancers, and potentially even more subtypes, it is essential to develop multi-category methodologies for molecular classification for any meaningful practical application. RESULTS A jackknife-based supervised learning method called paired-samples test algorithm (PST), coupled with a binary classification model based on linear regression, was proposed and applied to two well known and challenging datasets consisting of 14 (GCM dataset) and 9 (NC160 dataset) tumor types. The results showed that the proposed method improved the prediction accuracy of the test samples for the GCM dataset, especially when t-statistic was used in the primary feature selection. For the NCI60 dataset, the application of PST improved prediction accuracy when the numbers of used genes were relatively small (100 or 200). These improvements made the binary classification method more robust to the gene selection mechanism and the size of genes to be used. The overall prediction accuracies were competitive in comparison to the most accurate results obtained by several previous studies on the same datasets and with other methods. Furthermore, the relative confidence R(T) provided a unique insight into the sources of the uncertainty shown in the statistical classification and the potential variants within the same tumor type. CONCLUSION We proposed a novel bagging method for the classification and uncertainty assessment of multi-category tumor samples using gene expression information. The strengths were demonstrated in the application to two bench datasets.
Collapse
Affiliation(s)
- Wensheng Zhang
- Department of Animal and Dairy Science, University of Georgia, Athens, GA 30602, USA
| | | | | | | | | |
Collapse
|
648
|
Lee JS, Leem SH, Lee SY, Kim SC, Park ES, Kim SB, Kim SK, Kim YJ, Kim WJ, Chu IS. Expression signature of E2F1 and its associated genes predict superficial to invasive progression of bladder tumors. J Clin Oncol 2010; 28:2660-7. [PMID: 20421545 DOI: 10.1200/jco.2009.25.0977] [Citation(s) in RCA: 271] [Impact Index Per Article: 18.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
PURPOSE In approximately 20% of patients with superficial bladder tumors, the tumors progress to invasive tumors after treatment. Current methods of predicting the clinical behavior of these tumors prospectively are unreliable. We aim to identify a molecular signature that can reliably identify patients with high-risk superficial tumors that are likely to progress to invasive tumors. PATIENTS AND METHODS Gene expression data were collected from tumor specimens from 165 patients with bladder cancer. Various statistical methods, including leave-one-out cross-validation methods, were applied to identify a gene expression signature that could predict the likelihood of progression to invasive tumors and to test the robustness of the expression signature in an independent cohort. The robustness of the gene expression signature was validated in an independent (n = 353) cohort. RESULTS Supervised analysis of gene expression data revealed a gene expression signature that is strongly associated with invasive bladder tumors. A molecular classifier based on this gene expression signature correctly predicted the likelihood of progression of superficial tumor to invasive tumor. CONCLUSION We present a molecular signature that can predict, at diagnosis, the likelihood of bladder cancer progression and, possibly, lead to improvements in patient therapy.
Collapse
Affiliation(s)
- Ju-Seog Lee
- The University of Texas M. D. Anderson Cancer Center, Houston, TX, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
649
|
Hall P, Pham T. Optimal properties of centroid-based classifiers for very high-dimensional data. Ann Stat 2010. [DOI: 10.1214/09-aos736] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
650
|
|