1051
|
Abstract
An important goal of microarray studies is the detection of genes that show significant changes in expression when two classes of biological samples are being compared. We present an ANOVA-style mixed model with parameters for array normalization, overall level of gene expression, and change of expression between the classes. For the latter we assume a mixing distribution with a probability mass concentrated at zero, representing genes with no changes, and a normal distribution representing the level of change for the other genes. We estimate the parameters by optimizing the marginal likelihood. To make this practical, Laplace approximations and a backfitting algorithm are used. The performance of the model is studied by simulation and by application to publicly available data sets.
Collapse
Affiliation(s)
- Göran Kauermann
- Department of Economics and Business Administration, University of Bielefeld, 33501 Bielefeld, Germany.
| | | |
Collapse
|
1052
|
Helman P, Veroff R, Atlas SR, Willman C. A Bayesian network classification methodology for gene expression data. J Comput Biol 2005; 11:581-615. [PMID: 15579233 DOI: 10.1089/cmb.2004.11.581] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We present new techniques for the application of a Bayesian network learning framework to the problem of classifying gene expression data. The focus on classification permits us to develop techniques that address in several ways the complexities of learning Bayesian nets. Our classification model reduces the Bayesian network learning problem to the problem of learning multiple subnetworks, each consisting of a class label node and its set of parent genes. We argue that this classification model is more appropriate for the gene expression domain than are other structurally similar Bayesian network classification models, such as Naive Bayes and Tree Augmented Naive Bayes (TAN), because our model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest. Within this framework, we consider two different approaches to identifying parent sets which are supported by the gene expression observations and any other currently available evidence. One approach employs a simple greedy algorithm to search the universe of all genes; the second approach develops and applies a gene selection algorithm whose results are incorporated as a prior to enable an exhaustive search for parent sets over a restricted universe of genes. Two other significant contributions are the construction of classifiers from multiple, competing Bayesian network hypotheses and algorithmic methods for normalizing and binning gene expression data in the absence of prior expert knowledge. Our classifiers are developed under a cross validation regimen and then validated on corresponding out-of-sample test sets. The classifiers attain a classification rate in excess of 90% on out-of-sample test sets for two publicly available datasets. We present an extensive compilation of results reported in the literature for other classification methods run against these same two datasets. Our results are comparable to, or better than, any we have found reported for these two sets, when a train-test protocol as stringent as ours is followed.
Collapse
Affiliation(s)
- Paul Helman
- Computer Science Department, University of New Mexico, Albuquerque, NM 87131, USA.
| | | | | | | |
Collapse
|
1053
|
Bertoni A, Folgieri R, Valentini G. Bio-molecular cancer prediction with random subspace ensembles of support vector machines. Neurocomputing 2005. [DOI: 10.1016/j.neucom.2004.07.007] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
1054
|
Wang A, Gehan EA. Gene selection for microarray data analysis using principal component analysis. Stat Med 2005; 24:2069-87. [PMID: 15806617 DOI: 10.1002/sim.2082] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Principal component analysis (PCA) has been widely used in multivariate data analysis to reduce the dimensionality of the data in order to simplify subsequent analysis and allow for summarization of the data in a parsimonious manner. It has become a useful tool in microarray data analysis. For a typical microarray data set, it is often difficult to compare the overall gene expression difference between observations from different groups or conduct the classification based on a very large number of genes. In this paper, we propose a gene selection method based on the strategy proposed by Krzanowski. We demonstrate the effectiveness of this procedure using a cancer gene expression data set and compare it with several other gene selection strategies. It turns out that the proposed method selects the best gene subset for preserving the original data structure.
Collapse
Affiliation(s)
- Antai Wang
- Department of Biomathematics and Biostatistics, Georgetown University, Lombardi Cancer Center, 4000 Reservoir Road NW, Washington, DC 20057-1484, U.S.A.
| | | |
Collapse
|
1055
|
Tsai CA, Lee TC, Ho IC, Yang UC, Chen CH, Chen JJ. Multi-class clustering and prediction in the analysis of microarray data. Math Biosci 2004; 193:79-100. [PMID: 15681277 DOI: 10.1016/j.mbs.2004.07.002] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2003] [Revised: 06/07/2004] [Accepted: 07/27/2004] [Indexed: 11/23/2022]
Abstract
DNA microarray technology provides tools for studying the expression profiles of a large number of distinct genes simultaneously. This technology has been applied to sample clustering and sample prediction. Because of a large number of genes measured, many of the genes in the original data set are irrelevant to the analysis. Selection of discriminatory genes is critical to the accuracy of clustering and prediction. This paper considers statistical significance testing approach to selecting discriminatory gene sets for multi-class clustering and prediction of experimental samples. A toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV with a total of 55 samples) is used to illustrate a general framework of the approach. Among four selected gene sets, a gene set omega(I) formed by the intersection of the F-test and the set of the union of one-versus-all t-tests performs the best in terms of clustering as well as prediction. Hierarchical and two modified partition (k-means) methods all show that the set omega(I) is able to group the 55 samples into seven clusters reasonably well, in which the As and AsV samples are considered as one cluster (the same group) as are the Cd and Cu samples. With respect to prediction, the overall accuracy for the gene set omega(I) using the nearest neighbors algorithm to predict 55 samples into one of the nine treatments is 85%.
Collapse
Affiliation(s)
- Chen-An Tsai
- Division of Biometry and Risk Assessment, National Center for Toxicological Research, Food and Drug Administration NCTR/FDA/HFT-20 Jefferson, AR 72079, USA
| | | | | | | | | | | |
Collapse
|
1056
|
Abstract
MOTIVATION One problem with discriminant analysis of DNA microarray data is that each sample is represented by quite a large number of genes, and many of them are irrelevant, insignificant or redundant to the discriminant problem at hand. Methods for selecting important genes are, therefore, of much significance in microarray data analysis. In the present study, a new criterion, called LS Bound measure, is proposed to address the gene selection problem. The LS Bound measure is derived from leave-one-out procedure of LS-SVMs (least squares support vector machines), and as the upper bound for leave-one-out classification results it reflects to some extent the generalization performance of gene subsets. RESULTS We applied this LS Bound measure for gene selection on two benchmark microarray datasets: colon cancer and leukemia. We also compared the LS Bound measure with other evaluation criteria, including the well-known Fisher's ratio and Mahalanobis class separability measure, and other published gene selection algorithms, including Weighting factor and SVM Recursive Feature Elimination. The strength of the LS Bound measure is that it provides gene subsets leading to more accurate classification results than the filter method while its computational complexity is at the level of the filter method. AVAILABILITY A companion website can be accessed at http://www.ntu.edu.sg/home5/pg02776030/lsbound/. The website contains: (1) the source code of the gene selection algorithm; (2) the complete set of tables and figures regarding the experimental study; (3) proof of the inequality (9). CONTACT ekzmao@ntu.edu.sg.
Collapse
Affiliation(s)
- Xin Zhou
- School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang avenue, Singapore 639798
| | | |
Collapse
|
1057
|
Bickel PJ, Levina E. Some theory for Fisher's linear discriminant function, `naive Bayes', and some alternatives when there are many more variables than observations. BERNOULLI 2004. [DOI: 10.3150/bj/1106314847] [Citation(s) in RCA: 332] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
1058
|
Fort G, Lambert-Lacroix S. Classification using partial least squares with penalized logistic regression. Bioinformatics 2004; 21:1104-11. [PMID: 15531609 DOI: 10.1093/bioinformatics/bti114] [Citation(s) in RCA: 134] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION One important aspect of data-mining of microarray data is to discover the molecular variation among cancers. In microarray studies, the number n of samples is relatively small compared to the number p of genes per sample (usually in thousands). It is known that standard statistical methods in classification are efficient (i.e. in the present case, yield successful classifiers) particularly when n is (far) larger than p. This naturally calls for the use of a dimension reduction procedure together with the classification one. RESULTS In this paper, the question of classification in such a high-dimensional setting is addressed. We view the classification problem as a regression one with few observations and many predictor variables. We propose a new method combining partial least squares (PLS) and Ridge penalized logistic regression. We review the existing methods based on PLS and/or penalized likelihood techniques, outline their interest in some cases and theoretically explain their sometimes poor behavior. Our procedure is compared with these other classifiers. The predictive performance of the resulting classification rule is illustrated on three data sets: Leukemia, Colon and Prostate.
Collapse
|
1059
|
Statistical Analysis of a Loop Designed Microarray Experiment Data. KOREAN JOURNAL OF APPLIED STATISTICS 2004. [DOI: 10.5351/kjas.2004.17.3.419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
1060
|
Ji Y, Tsui KW, Kim K. A novel means of using gene clusters in a two-step empirical Bayes method for predicting classes of samples. Bioinformatics 2004; 21:1055-61. [PMID: 15514000 DOI: 10.1093/bioinformatics/bti092] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The classification of samples using gene expression profiles is an important application in areas such as cancer research and environmental health studies. However, the classification is usually based on a small number of samples, and each sample is a long vector of thousands of gene expression levels. An important issue in parametric modeling for so many gene expression levels is the control of the number of nuisance parameters in the model. Large models often lead to intensive or even intractable computation, while small models may be inadequate for complex data. METHODOLOGY We propose a two-step empirical Bayes classification method as a solution to this issue. At the first step, we use the model-based cluster algorithm with a non-traditional purpose of assigning gene expression levels to form abundance groups. At the second step, by assuming the same variance for all the genes in the same group, we substantially reduce the number of nuisance parameters in our statistical model. RESULTS The proposed model is more parsimonious, which leads to efficient computation under an empirical Bayes estimation procedure. We consider two real examples and simulate data using our method. Desired low classification error rates are obtained even when a large number of genes are pre-selected for class prediction.
Collapse
Affiliation(s)
- Yuan Ji
- Department of Biostatistics and Applied Mathematics, The University of Texas M.D. Anderson Cancer Center Houston, TX 77030, USA.
| | | | | |
Collapse
|
1061
|
Huang Y, Cai J, Ji L, Li Y. Classifying G-protein coupled receptors with bagging classification tree. Comput Biol Chem 2004; 28:275-80. [PMID: 15548454 DOI: 10.1016/j.compbiolchem.2004.08.001] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2004] [Revised: 08/05/2004] [Accepted: 08/06/2004] [Indexed: 11/17/2022]
Abstract
G-protein coupled receptors (GPCRs) play a key role in different biological processes, such as regulation of growth, death and metabolism of cells. They are major therapeutic targets of numerous prescribed drugs. However, the ligand specificity of many receptors is unknown and there is little structural information available. Bioinformatics may offer one approach to bridge the gap between sequence data and functional knowledge of a receptor. In this paper, we use a bagging classification tree algorithm to predict the type of the receptor based on its amino acid composition. The prediction is performed for GPCR at the sub-family and sub-sub-family level. In a cross-validation test, we achieved an overall predictive accuracy of 91.1% for GPCR sub-family classification, and 82.4% for sub-sub-family classification. These results demonstrate the applicability of this relative simple method and its potential for improving prediction accuracy.
Collapse
Affiliation(s)
- Ying Huang
- Department of Automation, MOE Key Laboratory of Bioinformatics, Institute of Bioinformatics, Tsinghua University, Beijing 10084, China.
| | | | | | | |
Collapse
|
1062
|
Hong H, Tong W, Perkins R, Fang H, Xie Q, Shi L. Multiclass Decision Forest—A Novel Pattern Recognition Method for Multiclass Classification in Microarray Data Analysis. DNA Cell Biol 2004; 23:685-94. [PMID: 15585126 DOI: 10.1089/dna.2004.23.685] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
The wealth of knowledge imbedded in gene expression data from DNA microarrays portends rapid advances in both research and clinic. Turning the prodigious and noisy data into knowledge is a challenge to the field of bioinformatics, and development of classifiers using supervised learning techniques is the primary methodological approach for clinical application using gene expression data. In this paper, we present a novel classification method, multiclass Decision Forest (DF), that is the direct extension of the two-class DF previously developed in our lab. Central to DF is the synergistic combining of multiple heterogenic but comparable decision trees to reach a more accurate and robust classification model. The computationally inexpensive multiclass DF algorithm integrates gene selection and model development, and thus eliminates the bias of gene preselection in crossvalidation. Importantly, the method provides several statistical means for assessment of prediction accuracy, prediction confidence, and diagnostic capability. We demonstrate the method by application to gene expression data for 83 small round blue-cell tumors (SRBCTs) samples belonging to one of four different classes. Based on 500 runs of 10-fold crossvalidation, tumor prediction accuracy was approximately 97%, sensitivity was approximately 95%, diagnostic sensitivity was approximately 91%, and diagnostic accuracy was approximately 99.5%. Among 25 genes selected to distinguish tumor class, 12 have functional information in the literature implicating their involvement in cancer. The four types of SRBCTs samples are also distinguishable in a clustering analysis based on the expression profiles of these 25 genes. The results demonstrated that the multiclass DF is an effective classification method for analysis of gene expression data for the purpose of molecular diagnostics.
Collapse
Affiliation(s)
- Huixiao Hong
- Bioinformatics Laboratory, National Center for Toxicological Research, FDA, Jefferson, Arkansas 72079, USA
| | | | | | | | | | | |
Collapse
|
1063
|
Tsai CA, Chen CH, Lee TC, Ho IC, Yang UC, Chen JJ. Gene Selection for Sample Classifications in Microarray Experiments. DNA Cell Biol 2004; 23:607-14. [PMID: 15585118 DOI: 10.1089/dna.2004.23.607] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of nondiscriminatory genes. Selection of an optimal subset from the original gene set becomes an important prestep in sample classification. In this study, we propose a family-wise error (FWE) rate approach to selection of discriminatory genes for two-sample or multiple-sample classification. The FWE approach controls the probability of the number of one or more false positives at a prespecified level. A public colon cancer data set is used to evaluate the performance of the proposed approach for the two classification methods: k nearest neighbors (k-NN) and support vector machine (SVM). The selected gene sets from the proposed procedure appears to perform better than or comparable to several results reported in the literature using the univariate analysis without performing multivariate search. In addition, we apply the FWE approach to a toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV) for a total of 55 samples for a multisample classification. Two gene sets are considered: the gene set omegaF formed by the ANOVA F-test, and a gene set omegaT formed by the union of one-versus-all t-tests. The predicted accuracies are evaluated using the internal and external crossvalidation. Using the SVM classification, the overall accuracies to predict 55 samples into one of the nine treatments are above 80% for internal crossvalidation. OmegaF has slightly higher accuracy rates than omegaT. The overall predicted accuracies are above 70% for the external crossvalidation; the two gene sets omegaT and omegaF performed equally well.
Collapse
Affiliation(s)
- Chen-An Tsai
- Division of Biometry and Risk Assessment, National Center for Toxicological Research, Food and Drug Administration, Jefferson, Arkansas 72079, USA
| | | | | | | | | | | |
Collapse
|
1064
|
|
1065
|
Ye J, Li T, Xiong T, Janardan R. Using uncorrelated discriminant analysis for tissue classification with gene expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2004; 1:181-90. [PMID: 17051700 DOI: 10.1109/tcbb.2004.45] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds); that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear Discriminant Analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called Uncorrelated Linear Discriminant Analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the Generalized Singular Value Decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data sets.
Collapse
Affiliation(s)
- Jieping Ye
- Department of Computer Science and Engineering, University of Minnesota, Twin Cities, 4-192 EE/CSci Bldg., 200 Union Street S.E., Minneapolis, MN 55455, USA.
| | | | | | | |
Collapse
|
1066
|
Holzman T, Kolker E. Statistical analysis of global gene expression data: some practical considerations. Curr Opin Biotechnol 2004; 15:52-7. [PMID: 15102467 DOI: 10.1016/j.copbio.2003.12.004] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Applying appropriate error models and conservative estimates to microarray data helps to reduce the number of false predictions and allows one to focus on biologically relevant observations. Several key conclusions have been drawn from the statistical analysis of global gene expression data: it is worth keeping core information for each experiment, including raw and processed data; biological and technical replicates are needed; careful experimental design makes the analysis simpler and more powerful; the choice of the similarity measure is nontrivial and depends on the goal of an experiment; array information must be complemented with other data; and gene expression studies are 'hypothesis generators'.
Collapse
Affiliation(s)
- Ted Holzman
- BIATECH, 19310 North Creek Parkway, Suite 115, Bothell, WA 98011, USA
| | | |
Collapse
|
1067
|
Desper R, Khan J, Schäffer AA. Tumor classification using phylogenetic methods on expression data. J Theor Biol 2004; 228:477-96. [PMID: 15178197 DOI: 10.1016/j.jtbi.2004.02.021] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2003] [Revised: 02/03/2004] [Accepted: 02/20/2004] [Indexed: 10/26/2022]
Abstract
Tumor classification is a well-studied problem in the field of bioinformatics. Developments in the field of DNA chip design have now made it possible to measure the expression levels of thousands of genes in sample tissue from healthy cell lines or tumors. A number of studies have examined the problems of tumor classification: class discovery, the problem of defining a number of classes of tumors using the data from a DNA chip, and class prediction, the problem of accurately classifying an unknown tumor, given expression data from the unknown tumor and from a learning set. The current work has applied phylogenetic methods to both problems. To solve the class discovery problem, we impose a metric on a set of tumors as a function of their gene expression levels, and impose a tree structure on this metric, using standard tree fitting methods borrowed from the field of phylogenetics. Phylogenetic methods provide a simple way of imposing a clear hierarchical relationship on the data, with branch lengths in the classification tree representing the degree of separation witnessed. We tested our method for class discovery on two data sets: a data set of 87 tissues, comprised mostly of small, round, blue-cell tumors (SRBCTs), and a data set of 22 breast tumors. We fit the 87 samples of the first set to a classification tree, which neatly separated into four major clusters corresponding exactly to the four groups of tumors, namely neuroblastomas, rhabdomyosarcomas, Burkitt's lymphomas, and the Ewing's family of tumors. The classification tree built using the breast cancer data separated tumors with BRCA1 mutations from those with BRCA2 mutations, with sporadic tumors separated from both groups and from each other. We also demonstrate the flexibility of the class discovery method with regard to standard resampling methodology such as jackknifing and noise perturbation. To solve the class prediction problem, we built a classification tree on the learning set, and then sought the optimal placement of each test sample within the classification tree. We tested this method on the SRBCT data set, and classified each tumor successfully.
Collapse
Affiliation(s)
- Richard Desper
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bldg. 38A, Room 8N805, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | | | | |
Collapse
|
1068
|
Pawitan Y, Bjöhle J, Wedren S, Humphreys K, Skoog L, Huang F, Amler L, Shaw P, Hall P, Bergh J. Gene expression profiling for prognosis using Cox regression. Stat Med 2004; 23:1767-80. [PMID: 15160407 DOI: 10.1002/sim.1769] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Given the promise of rich biological information in microarray data we will expect an increasing demand for a robust, practical and well-tested methodology to provide patient prognosis based on gene expression data. In standard settings, with few clinical predictors, such a methodology has been provided by the Cox proportional hazard model, but no corresponding methodology is available to deal with the full set of genes in microarray data. Furthermore, we want the procedure to be able to deal with the general survival data that include censored information. Conceptually such a procedure can be constructed quite easily, but its implementation will never be straightforward due to computational problems. We have developed an approach that relies on an extension of the Cox proportional likelihood that allows random effects parameters. In this approach, we use the full set of genes in the analysis and deal with survival data in the most general way. We describe the development of the model and the steps in the implementation, including a fast computational formula based on a subsampling of the risk set and the singular value decomposition. Finally, we illustrate the methodology using a data set obtained from a cohort of breast cancer patients.
Collapse
Affiliation(s)
- Y Pawitan
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
1069
|
Liu H, Wong L. Data mining tools for biological sequences. J Bioinform Comput Biol 2004; 1:139-67. [PMID: 15290785 DOI: 10.1142/s0219720003000216] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2002] [Revised: 04/07/2003] [Accepted: 04/07/2003] [Indexed: 11/18/2022]
Abstract
We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences.
Collapse
Affiliation(s)
- Huiqing Liu
- Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore.
| | | |
Collapse
|
1070
|
Soukup M, Lee JK. Developing optimal prediction models for cancer classification using gene expression data. J Bioinform Comput Biol 2004; 1:681-94. [PMID: 15290759 DOI: 10.1142/s0219720004000351] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2003] [Revised: 08/05/2003] [Accepted: 08/07/2003] [Indexed: 11/18/2022]
Abstract
Microarrays can provide genome-wide expression patterns for various cancers, especially for tumor sub-types that may exhibit substantially different patient prognosis. Using such gene expression data, several approaches have been proposed to classify tumor sub-types accurately. These classification methods are not robust, and often dependent on a particular training sample for modelling, which raises issues in utilizing these methods to administer proper treatment for a future patient. We propose to construct an optimal, robust prediction model for classifying cancer sub-types using gene expression data. Our model is constructed in a step-wise fashion implementing cross-validated quadratic discriminant analysis. At each step, all identified models are validated by an independent sample of patients to develop a robust model for future data. We apply the proposed methods to two microarray data sets of cancer: the acute leukemia data by Golub et al. and the colon cancer data by Alon et al. We have found that the dimensionality of our optimal prediction models is relatively small for these cases and that our prediction models with one or two gene factors outperforms or has competing performance, especially for independent samples, to other methods based on 50 or more predictive gene factors. The methodology is implemented and developed by the procedures in R and Splus. The source code can be obtained at http://hesweb1.med.virginia.edu/bioinformatics.
Collapse
Affiliation(s)
- Mat Soukup
- Department of Statistics, University of Virginia, Halsey Hall, Charlottesville, VA 22904-4135, USA.
| | | |
Collapse
|
1071
|
Guan Z, Zhao H. A semiparametric approach for marker gene selection based on gene expression data. Bioinformatics 2004; 21:529-36. [PMID: 15374863 DOI: 10.1093/bioinformatics/bti032] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Identification of differentially expressed genes is a major issue in gene expression data analysis and selection of marker genes is critical in tumor classification using gene expression data. In this paper, we propose a semiparametric two-sample test to identify both differentially expressed genes and select marker genes for sample classification. RESULTS A simulation study shows that the proposed method is more robust and powerful than the methods, generally used such as t-tests and non-parametric rank-sum tests, when the sample size is small. Cross-validation shows that the sample classification based on genes selected using this semiparametric method has lower misclassification rates. CONTACT hongyu.zhao@yale.edu.
Collapse
Affiliation(s)
- Zhong Guan
- Department of Mathematical Sciences, Indiana University South Bend South Bend, IN 46634, USA
| | | |
Collapse
|
1072
|
Cho JH, Lee D, Park JH, Lee IB. Gene selection and classification from microarray data using kernel machine. FEBS Lett 2004; 571:93-8. [PMID: 15280023 DOI: 10.1016/j.febslet.2004.05.087] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2004] [Revised: 05/18/2004] [Accepted: 05/18/2004] [Indexed: 11/22/2022]
Abstract
The discrimination of cancer patients (including subtypes) based on gene expression data is a critical problem with clinical ramifications. Central to solving this problem is the issue of how to extract the most relevant genes from the several thousand genes on a typical microarray. Here, we propose a methodology that can effectively select an informative subset of genes and classify the subtypes (or patients) of disease using the selected genes. We employ a kernel machine, kernel Fisher discriminant analysis (KFDA), for discrimination and use the derivatives of the kernel function to perform gene selection. Using a modified form of KFDA in the minimum squared error (MSE) sense and the gradients of the kernel functions, we construct an effective gene selection criterion. We assess the performance of the proposed methodology by applying it to three gene expression datasets: leukemia dataset, breast cancer dataset and colon cancer dataset. Using a few informative genes, the proposed method accurately and reliably classified cancer subtypes (or patients). Also, through a comparison study, we verify the reliability of the gene selection and discrimination results.
Collapse
Affiliation(s)
- Ji-Hoon Cho
- Department of Chemical Engineering, Pohang University of Science and Technology, San 31 Hyoja-Dong, Pohang 790-784, Republic of Korea
| | | | | | | |
Collapse
|
1073
|
Li W, Sun F, Grosse I. Extreme value distribution based gene selection criteria for discriminant microarray data analysis using logistic regression. J Comput Biol 2004; 11:215-26. [PMID: 15285889 DOI: 10.1089/1066527041410445] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
One important issue commonly encountered in the analysis of microarray data is to decide which and how many genes should be selected for further studies. For discriminant microarray data analyses based on statistical models, such as the logistic regression models, gene selection can be accomplished by a comparison of the maximum likelihood of the model given the real data, L(D|M), and the expected maximum likelihood of the model given an ensemble of surrogate data with randomly permuted label, L(D(0)|M). Typically, the computational burden for obtaining L(D(0)M) is immense, often exceeding the limits of available computing resources by orders of magnitude. Here, we propose an approach that circumvents such heavy computations by mapping the simulation problem to an extreme-value problem. We present the derivation of an asymptotic distribution of the extreme-value as well as its mean, median, and variance. Using this distribution, we propose two gene selection criteria, and we apply them to two microarray datasets and three classification tasks for illustration.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, North Shore LIJ Research Institute, 350 Community Drive, Manhasset, NY 11030, USA.
| | | | | |
Collapse
|
1074
|
Nørsett KG, Laegreid A, Midelfart H, Yadetie F, Erlandsen SE, Falkmer S, Grønbech JE, Waldum HL, Komorowski J, Sandvik AK. Gene expression based classification of gastric carcinoma. Cancer Lett 2004; 210:227-37. [PMID: 15183539 DOI: 10.1016/j.canlet.2004.01.022] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2003] [Revised: 01/25/2004] [Accepted: 01/28/2004] [Indexed: 12/13/2022]
Abstract
The aim of the present work is to identify molecular markers that allow classification of gastric carcinoma with respect to important clinicopathological parameters. Gastric adenocarcinomas were subjected to cDNA microarray analysis with a 2.504 gene probe set. Using the Rosetta rough-set based learning system, good classifiers were generated for gene-expression based prediction of intestinal or diffuse growth pattern according to Laurén's classification and presence of lymph node metastases. To our knowledge, this is the first study on gastric carcinoma in which molecular classification has been achieved for more than one clinicopathological parameter based on microarray gene expression profiles.
Collapse
Affiliation(s)
- Kristin G Nørsett
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, NTNU, N-7489 Trondheim, Norway
| | | | | | | | | | | | | | | | | | | |
Collapse
|
1075
|
Rank sum method for related gene selection and its application to tumor diagnosis. ACTA ACUST UNITED AC 2004. [DOI: 10.1007/bf03184138] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
1076
|
|
1077
|
|
1078
|
Brabender J, Marjoram P, Salonga D, Metzger R, Schneider PM, Park JM, Schneider S, Hölscher AH, Yin J, Meltzer SJ, Danenberg KD, Danenberg PV, Lord RV. A multigene expression panel for the molecular diagnosis of Barrett's esophagus and Barrett's adenocarcinoma of the esophagus. Oncogene 2004; 23:4780-8. [PMID: 15107828 DOI: 10.1038/sj.onc.1207663] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
In order to identify genes or combination of genes that have the power to discriminate between premalignant Barrett's esophagus and Barrett's associated adenocarcinoma, we analysed a panel of 23 genes using quantitative real-time RT-PCR (qRT-PCR, Taqman and bioinformatic tools. The genes chosen were either known to be associated with Barrett's carcinogenesis or were filtered from a previous cDNA microarray study on Barrett's adenocarcinoma. A total of 98 tissues, obtained from 19 patients with Barrett's esophagus (BE group) and 20 patients with Barrett's associated esophageal adenocarcinoma (EA group), were studied. Triplicate analysis for the full 23 gene of interest panel, and analysis of an internal control gene, was performed for all samples, for a total of more than 9016 single PCR reactions. We found distinct classes of gene expression patterns in the different types of tissues. The most informative genes clustered in six different classes and had significantly different expression levels in Barrett's esophagus tissues compared to adenocarcinoma tissues. Linear discriminant analysis (LDA) distinguished four genetically different groups. The normal squamous esophagus tissues from patients with BE or EA were not distinguishable from one another, but Barrett's esophagus tissues could be distinguished from adenocarcinoma tissues. Using the most informative genes, obtained from a logistic regression analysis, we were able to completely distinguish between benign Barrett's and Barrett's adenocarcinomas. This study provides the first non-array parallel mRNA quantitation analysis of a panel of genes in the Barrett's esophagus model of multistage carcinogenesis. Our results suggest that mRNA expression quantitation of a panel of genes can discriminate between premalignant and malignant Barrett's disease. Logistic regression and LDAs can be used to further identify, from the complete panel, gene subsets with the power to make these diagnostic distinctions. Expression analysis of a limited number of highly selected genes may have clinical usefulness for the treatment of patients with this disease.
Collapse
Affiliation(s)
- Jan Brabender
- Department of Visceral and Vascular Surgery, University of Cologne, 50931 Germany.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1079
|
Abstract
This article describes the theoretical and practical issues in experimental design for gene expression microarrays. Specifically, this article 1) discusses the basic principles of design (randomization, replication, and blocking) as they pertain to microarrays, and 2) provides some general guidelines for statisticians designing microarray studies.
Collapse
Affiliation(s)
- M Kathleen Kerr
- Department of Biostatistics, University of Washington, Box 357232, Seattle, Washington, USA.
| |
Collapse
|
1080
|
Abstract
Due to the advent of high-throughput microarray technology, it has become possible to develop molecular classification systems for various types of cancer. In this article, we propose a methodology using regularized regression models for the classification of tumors in microarray experiments. The performances of principal components, partial least squares, and ridge regression models are studied; these regression procedures are adapted to the classification setting using the optimal scoring algorithm. We also develop a procedure for ranking genes based on the fitted regression models. The proposed methodologies are applied to two microarray studies in cancer.
Collapse
Affiliation(s)
- Debashis Ghosh
- Department of Biostatistics, University of Michigan, 1420 Washington Heights, Ann Arbor, Michigan 48105, USA.
| |
Collapse
|
1081
|
On partial least squares dimension reduction for microarray-based classification: a simulation study. Comput Stat Data Anal 2004. [DOI: 10.1016/j.csda.2003.08.001] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
1082
|
Jain S, Watson MA, DeBenedetti MK, Hiraki Y, Moley JF, Milbrandt J. Expression Profiles Provide Insights into Early Malignant Potential and Skeletal Abnormalities in Multiple Endocrine Neoplasia Type 2B Syndrome Tumors. Cancer Res 2004; 64:3907-13. [PMID: 15173001 DOI: 10.1158/0008-5472.can-03-3801] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Identifying the molecular basis for genotype-phenotype correlations in human diseases has direct implications for understanding the disease process and hence for the identification of potential therapeutic targets. To this end, we performed microarray expression analysis on benign (pheochromocytomas) and malignant (medullary thyroid carcinomas, MTCs) tumors from patients with multiple endocrine neoplasia (MEN) type 2A or 2B, related syndromes that result from distinctive mutations in the RET receptor tyrosine kinase. Comparisons of MEN 2B and MEN 2A MTCs revealed that genes involved in the process of epithelial to mesenchymal transition, many associated with the tumor growth factor beta pathway, were up-regulated in MEN 2B MTCs. This MEN 2B MTC profile may explain the early onset of malignancy in MEN 2B compared with MEN 2A patients. Furthermore, chondromodulin-1, a known regulator of cartilage and bone growth, was expressed at high levels specifically in MEN 2B MTCs. Chondromodulin-1 mRNA and protein expression was localized to the malignant C cells, and its high expression was directly associated with the presence of skeletal abnormalities in MEN 2B patients. These findings provide molecular evidence that associate the previously unexplained skeletal abnormalities and early malignancy in MEN 2B compared with MEN 2A syndrome.
Collapse
Affiliation(s)
- Sanjay Jain
- Department of Pathology, Washington University School of Medicine, St Louis, Missouri 63110, USA
| | | | | | | | | | | |
Collapse
|
1083
|
Wolfe P, Murphy J, McGinley J, Zhu Z, Jiang W, Gottschall EB, Thompson HJ. Using Nuclear Morphometry to Discriminate the Tumorigenic Potential of Cells: A Comparison of Statistical Methods. Cancer Epidemiol Biomarkers Prev 2004. [DOI: 10.1158/1055-9965.976.13.6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Abstract
Despite interest in the use of nuclear morphometry for cancer diagnosis and prognosis as well as to monitor changes in cancer risk, no generally accepted statistical method has emerged for the analysis of these data. To evaluate different statistical approaches, Feulgen-stained nuclei from a human lung epithelial cell line, BEAS-2B, and a human lung adenocarcinoma (non-small cell) cancer cell line, NCI-H522, were subjected to morphometric analysis using a CAS-200 imaging system. The morphometric characteristics of these two cell lines differed significantly. Therefore, we proceeded to address the question of which statistical approach was most effective in classifying individual cells into the cell lines from which they were derived. The statistical techniques evaluated ranged from simple, traditional, parametric approaches to newer machine learning techniques. The multivariate techniques were compared based on a systematic cross-validation approach using 10 fixed partitions of the data to compute the misclassification rate for each method. For comparisons across cell lines at the level of each morphometric feature, we found little to distinguish nonparametric from parametric approaches. Among the linear models applied, logistic regression had the highest percentage of correct classifications; among the nonlinear and nonparametric methods applied, the Classification and Regression Trees model provided the highest percentage of correct classifications. Classification and Regression Trees has appealing characteristics: there are no assumptions about the distribution of the variables to be used, there is no need to specify which interactions to test, and there is no difficulty in handling complex, high-dimensional data sets containing mixed data types.
Collapse
Affiliation(s)
- Pamela Wolfe
- 1Cancer Prevention Laboratory, Colorado State University, Fort Collins, Colorado and
| | - James Murphy
- 1Cancer Prevention Laboratory, Colorado State University, Fort Collins, Colorado and
| | - John McGinley
- 2Departments of Biometrics and Occupational Medicine, National Jewish Medical and Research Center, Denver, Colorado
| | - Zongjian Zhu
- 1Cancer Prevention Laboratory, Colorado State University, Fort Collins, Colorado and
| | - Weiqin Jiang
- 1Cancer Prevention Laboratory, Colorado State University, Fort Collins, Colorado and
| | - E. Brigitte Gottschall
- 2Departments of Biometrics and Occupational Medicine, National Jewish Medical and Research Center, Denver, Colorado
| | - Henry J. Thompson
- 1Cancer Prevention Laboratory, Colorado State University, Fort Collins, Colorado and
| |
Collapse
|
1084
|
Inza I, Larrañaga P, Blanco R, Cerrolaza AJ. Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med 2004; 31:91-103. [PMID: 15219288 DOI: 10.1016/j.artmed.2004.01.007] [Citation(s) in RCA: 171] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2003] [Revised: 07/22/2003] [Accepted: 01/16/2004] [Indexed: 11/18/2022]
Abstract
DNA microarray experiments generating thousands of gene expression measurements, are used to collect information from tissue and cell samples regarding gene expression differences that could be useful for diagnosis disease, distinction of the specific tumor type, etc. One important application of gene expression microarray data is the classification of samples into known categories. As DNA microarray technology measures the gene expression en masse, this has resulted in data with the number of features (genes) far exceeding the number of samples. As the predictive accuracy of supervised classifiers that try to discriminate between the classes of the problem decays with the existence of irrelevant and redundant features, the necessity of a dimensionality reduction process is essential. We propose the application of a gene selection process, which also enables the biology researcher to focus on promising gene candidates that actively contribute to classification in these large scale microarrays. Two basic approaches for feature selection appear in machine learning and pattern recognition literature: the filter and wrapper techniques. Filter procedures are used in most of the works in the area of DNA microarrays. In this work, a comparison between a group of different filter metrics and a wrapper sequential search procedure is carried out. The comparison is performed in two well-known DNA microarray datasets by the use of four classic supervised classifiers. The study is carried out over the original-continuous and three-intervals discretized gene expression data. While two well-known filter metrics are proposed for continuous data, four classic filter measures are used over discretized data. The same wrapper approach is used for both continuous and discretized data. The application of filter and wrapper gene selection procedures leads to considerably better accuracy results in comparison to the non-gene selection approach, coupled with interesting and notable dimensionality reductions. Although the wrapper approach mainly shows a more accurate behavior than filter metrics, this improvement is coupled with considerable computer-load necessities. We note that most of the genes selected by proposed filter and wrapper procedures in discrete and continuous microarray data appear in the lists of relevant-informative genes detected by previous studies over these datasets. The aim of this work is to make contributions in the field of the gene selection task in DNA microarray datasets. By an extensive comparison with more popular filter techniques, we would like to make contributions in the expansion and study of the wrapper approach in this type of domains.
Collapse
Affiliation(s)
- Iñaki Inza
- Department of Computer Science and Artificial Intelligence, University of the Basque Country, P.O. Box 649, E-20080 Donostia-San Sebastián, Basque Country, Spain.
| | | | | | | |
Collapse
|
1085
|
Xu T, Shu CT, Purdom E, Dang D, Ilsley D, Guo Y, Weber J, Holmes SP, Lee PP. Microarray Analysis Reveals Differences in Gene Expression of Circulating CD8+ T Cells in Melanoma Patients and Healthy Donors. Cancer Res 2004; 64:3661-7. [PMID: 15150126 DOI: 10.1158/0008-5472.can-03-3396] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Circulating T cells from many cancer patients are known to be dysfunctional and undergo spontaneous apoptosis. We used microarray technology to determine whether gene expression differences exist in T cells from melanoma patients versus healthy subjects, which may underlie these abnormalities. To maximize the resolution of our data, we sort purified CD8+ subsets and amplified the extracted RNA for microarray analysis. These analyses show subtle but statistically significant expression differences for 10 genes in T cells from melanoma patients versus healthy controls, which were additionally confirmed by quantitative real-time PCR analysis. Whereas none of these genes are members of the classical apoptosis pathways, several may be linked to apoptosis. To additionally investigate the significance of these 10 genes, we combined them into a classifier and found that they provide a much better discrimination between melanoma and healthy T cells as compared with a classifier built uniquely with classical apoptosis-related genes. These results suggest the possible engagement of an alternative apoptosis pathway in circulating T cells from cancer patients.
Collapse
Affiliation(s)
- Tong Xu
- Division of Hematology, Stanford University School of Medicine, Stanford, CA 94305, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
1086
|
Ellis M, Ballman K. Trawling for genes that predict response to breast cancer adjuvant therapy. J Clin Oncol 2004; 22:2267-9. [PMID: 15136594 DOI: 10.1200/jco.2004.03.950] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
1087
|
Chung CH, Parker JS, Karaca G, Wu J, Funkhouser WK, Moore D, Butterfoss D, Xiang D, Zanation A, Yin X, Shockley WW, Weissler MC, Dressler LG, Shores CG, Yarbrough WG, Perou CM. Molecular classification of head and neck squamous cell carcinomas using patterns of gene expression. Cancer Cell 2004; 5:489-500. [PMID: 15144956 DOI: 10.1016/s1535-6108(04)00112-6] [Citation(s) in RCA: 473] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/04/2003] [Revised: 02/02/2004] [Accepted: 03/09/2004] [Indexed: 12/15/2022]
Abstract
The prognostication of head and neck squamous cell carcinoma (HNSCC) is largely based upon the tumor size and location and the presence of lymph node metastases. Here we show that gene expression patterns from 60 HNSCC samples assayed on cDNA microarrays allowed categorization of these tumors into four distinct subtypes. These subtypes showed statistically significant differences in recurrence-free survival and included a subtype with a possible EGFR-pathway signature, a mesenchymal-enriched subtype, a normal epithelium-like subtype, and a subtype with high levels of antioxidant enzymes. Supervised analyses to predict lymph node metastasis status were approximately 80% accurate when tumor subsite and pathological node status were considered simultaneously. This work represents an important step toward the identification of clinically significant biomarkers for HNSCC.
Collapse
Affiliation(s)
- Christine H Chung
- Division of Hematology/Oncology, Department of Medicine, Vanderbuilt-Ingram Cancer Center, Vanderbuilt University School of Medicine, Nashville, Tennessee 37232, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1088
|
Meireles SI, Cristo EB, Carvalho AF, Hirata R, Pelosof A, Gomes LI, Martins WK, Begnami MD, Zitron C, Montagnini AL, Soares FA, Neves EJ, Reis LFL. Molecular classifiers for gastric cancer and nonmalignant diseases of the gastric mucosa. Cancer Res 2004; 64:1255-65. [PMID: 14973074 DOI: 10.1158/0008-5472.can-03-1850] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
High incidence of gastric cancer-related death is mainly due to diagnosis at an advanced stage in addition to the lack of adequate neoadjuvant therapy. Hence, new tools aimed at early diagnosis would have a positive impact in the outcome of the disease. Using cDNA arrays having 376 genes either identified previously as altered in gastric tumors or known to be altered in human cancer, we determined expression signature of 99 tissue fragments representing normal gastric mucosa, gastritis, intestinal metaplasia, and adenocarcinomas. We first validated the array by identifying molecular markers that are associated with intestinal metaplasia, considered as a transition stage of gastric adenocarcinomas of the intestinal type as well as markers that are associated with diffuse type of gastric adenocarcinomas. Next, we applied Fisher's linear discriminant analysis in an exhaustive search of trios of genes that could be used to build classifiers for class distinction. Many classifiers could distinguish between normal and tumor samples, whereas, for the distinction of gastritis from tumor and for metaplasia from tumor, fewer classifiers were identified. Statistical validations showed that trios that discriminate between normal and tumor samples are powerful classifiers to distinguish between tumor and nontumor samples. More relevant, it was possible to identify samples of intestinal metaplasia that have expression signature resembling that of an adenocarcinoma and can now be used for follow-up of patients to determine their potential as a prognostic test for malignant transformation.
Collapse
Affiliation(s)
- Sibele I Meireles
- Ludwig Institute for Cancer Research, Rua Prof. Antonio Prudente 109, São Paulo, SP 01509-010, Brazil
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1089
|
Kim C. Statistical Methods for Gene Expression Data. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2004. [DOI: 10.5351/ckss.2004.11.1.059] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
1090
|
Abstract
Predicted survival probability functions of censored event free survival are improved by bagging survival trees. We suggest a new method to aggregate survival trees in order to obtain better predictions for breast cancer and lymphoma patients. A set of survival trees based on B bootstrap samples is computed. We define the aggregated Kaplan-Meier curve of a new observation by the Kaplan-Meier curve of all observations identified by the B leaves containing the new observation. The integrated Brier score is used for the evaluation of predictive models. We analyse data of a large trial on node positive breast cancer patients conducted by the German Breast Cancer Study Group and a smaller 'pilot' study on diffuse large B-cell lymphoma, where prognostic factors are derived from microarray expression values. In addition, simulation experiments underline the predictive power of our proposal.
Collapse
Affiliation(s)
- Torsten Hothorn
- Department of Medical Informatics, Biometry and Epidemiology, Friedrich-Alexander-University, Erlangen-Nuremberg, Waldstrasse 6, D-91054 Erlangen, Germany
| | | | | | | |
Collapse
|
1091
|
Abstract
DNA microarrays are a potentially powerful technology for improving diagnostic classification, treatment selection and prognostic assessment. There are, however, many potential pitfalls in the use of microarrays that result in false leads and erroneous conclusions. Effective use of this technology requires new levels of interdisciplinary collaboration with statistical and computational scientists. This paper provides a review of the key features to be observed in developing diagnostic and prognostic classification systems based upon gene expression profiling. It also attempts to outline some of the steps needed to develop initial microarray research findings into classification systems suitable for broad clinical application.
Collapse
Affiliation(s)
- Richard Simon
- Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, 9000 Rockville Pike, MSC #7434, Bethesda, MD 20892, USA.
| |
Collapse
|
1092
|
Bueno R, Loughlin KR, Powell MH, Gordon GJ. A diagnostic test for prostate cancer from gene expression profiling data. J Urol 2004; 171:903-6. [PMID: 14713850 DOI: 10.1097/01.ju.0000095446.10443.52] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
PURPOSE Multiple recent studies show excellent classification accuracy using bioinformatics tools applied to expression profiling data on various tumors. However, the clinical applicability of these techniques remains unfulfilled because of difficulty in translating complex multigene mathematical algorithms into reproducible, platform independent tests. We recently developed a broadly applicable platform independent method based on simple ratios of gene expression to diagnose and predict outcome in cancer. In the current study we applied this technique to the diagnosis of prostate cancer. MATERIALS AND METHODS We developed a ratio based predictive model using a training set of 32 samples with previously published gene profiling data. We then tested and refined the model using additional independent samples with previously published microarray data from another source (that is the test set of 34 samples). Finally, the optimal ratio based test was examined with quantitative reverse transcriptase-polymerase chain reaction for data acquisition in a third cohort of samples consisting of 10 frozen normal and 10 tumor prostate tissues. RESULTS A 3-ratio test using 4 genes was 90% accurate (18 of 20 samples) for distinguishing normal prostate and prostate cancer samples obtained at surgery (Fisher's exact test p = 0.0007). This test did not result in any false-negative findings. CONCLUSIONS We describe and validate a new gene ratio based test for the diagnosis of prostate cancer, which was developed from the analysis of extensive gene profiling data for the diagnosis of prostate cancer. This test can be easily adapted to the clinical arena without the need for complex computer software or hardware. We anticipate that the gene ratio based diagnosis of prostate cancer using fine needle aspirations could serve as a useful adjunct to standard histopathological techniques.
Collapse
Affiliation(s)
- Raphael Bueno
- Thoracic Surgery Oncology Laboratory and Division of Thoracic Surgery, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts 02115,USA.
| | | | | | | |
Collapse
|
1093
|
Peng S, Xu Q, Ling XB, Peng X, Du W, Chen L. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. FEBS Lett 2004; 555:358-62. [PMID: 14644442 DOI: 10.1016/s0014-5793(03)01275-4] [Citation(s) in RCA: 82] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Simultaneous multiclass classification of tumor types is essential for future clinical implementations of microarray-based cancer diagnosis. In this study, we have combined genetic algorithms (GAs) and all paired support vector machines (SVMs) for multiclass cancer identification. The predictive features have been selected through iterative SVMs/GAs, and recursive feature elimination post-processing steps, leading to a very compact cancer-related predictive gene set. Leave-one-out cross-validations yielded accuracies of 87.93% for the eight-class and 85.19% for the fourteen-class cancer classifications, outperforming the results derived from previously published methods.
Collapse
Affiliation(s)
- Sihua Peng
- National Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, PR China
| | | | | | | | | | | |
Collapse
|
1094
|
Yoshida R, Higuchi T, Imoto S. A mixed factors model for dimension reduction and extraction of a group structure in gene expression data. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2004:161-72. [PMID: 16448010 DOI: 10.1109/csb.2004.1332429] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
When we cluster tissue samples on the basis of genes, the number of observations to be grouped is much smaller than the dimension of feature vector. In such a case, the applicability of conventional model-based clustering is limited since the high dimensionality of feature vector leads to overfitting during the density estimation process. To overcome such difficulty, we attempt a methodological extension of the factor analysis. Our approach enables us not only to prevent from the occurrence of overfitting, but also to handle the issues of clustering, data compression and extracting a set of genes to be relevant to explain the group structure. The potential usefulness are demonstrated with the application to the leukemia dataset.
Collapse
Affiliation(s)
- Ryo Yoshida
- The Graduate University for Advanced Studies, Minato-ku, Tokyo, Japan.
| | | | | |
Collapse
|
1095
|
Affiliation(s)
- Helen Kim
- Department of Pharmacology and Toxicology, University of Alabama at Birmingham, Birmingham, Alabama 35294, USA.
| | | | | |
Collapse
|
1096
|
Valentini G, Muselli M, Ruffino F. Cancer recognition with bagged ensembles of support vector machines. Neurocomputing 2004. [DOI: 10.1016/j.neucom.2003.09.001] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
1097
|
Simon R. Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data. Br J Cancer 2003; 89:1599-604. [PMID: 14583755 PMCID: PMC2394420 DOI: 10.1038/sj.bjc.6601326] [Citation(s) in RCA: 98] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
DNA microarrays are a potentially powerful technology for improving diagnostic classification, treatment selection and therapeutics development. There are, however, many potential pitfalls in the use of microarrays that result in false leads and erroneous conclusions. This paper provides a review of the key features to be observed in developing diagnostic and prognostic classification systems based on gene expression profiling and some of the pitfalls to be aware of in reading reports of microarray-based studies.
Collapse
Affiliation(s)
- R Simon
- Biometric Research Branch, Division of Cancer Treatment & Diagnosis, National Cancer Institute, National Institutes of Health, 9000 Rockville Pike, MSC #7434, Bethesda, MD 20892, USA.
| |
Collapse
|
1098
|
Shedden KA, Taylor JMG, Giordano TJ, Kuick R, Misek DE, Rennert G, Schwartz DR, Gruber SB, Logsdon C, Simeone D, Kardia SLR, Greenson JK, Cho KR, Beer DG, Fearon ER, Hanash S. Accurate molecular classification of human cancers based on gene expression using a simple classifier with a pathological tree-based framework. THE AMERICAN JOURNAL OF PATHOLOGY 2003; 163:1985-95. [PMID: 14578198 DOI: 10.1016/s0002-9440(10)63557-2] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Recent studies suggest accurate prediction of tissue of origin for human cancers can be achieved by applying sophisticated statistical learning procedures to gene expression data obtained from DNA microarrays. We have pursued the hypothesis that a more straightforward and equally accurate strategy for classifying human tumors is to use a simple algorithm that considers gene expression levels within a tree-based framework that encodes limited information about pathology and tissue ontogeny. By considering gene expression data within this framework, we found only a small number of genes were required to achieve a relatively high accuracy level in tumor classification. Using as few as 45 genes we were able to classify 157 of 190 human malignant tumors correctly, which is comparable to previous results obtained with sophisticated classifiers using thousands of genes. Our simple classifier accurately predicted the origin of metastatic tumors even when the classifier was trained using only primary tumors, and the classifier produced accurate predictions when trained and tested on expression data from different labs, and from different microarray platforms. Our findings suggest that accurate and robust cancer diagnosis from gene expression profiles can be achieved by mimicking the classification strategies routinely used by surgical pathologists.
Collapse
Affiliation(s)
- Kerby A Shedden
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109-1027, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1099
|
Abstract
A variety of new procedures have been devised to handle the two-sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some defining characteristics of microarray-based studies: (i) the very large number of genes contributing expression measures which far exceeds the number of samples (observations) available and (ii) the fact that by virtue of pathway/network relationships, the gene expression measures tend to be highly correlated. These concerns are exacerbated in the regression setting, where the objective is to relate gene expression, simultaneously for multiple genes, to some external outcome or phenotype. Correspondingly, several methods have been recently proposed for addressing these issues. We briefly critique some of these methods prior to a detailed evaluation of gene harvesting. This reveals that gene harvesting, without additional constraints, can yield artifactual solutions. Results obtained employing such constraints motivate the use of regularized regression procedures such as the lasso, least angle regression, and support vector machines. Model selection and solution multiplicity issues are also discussed. The methods are evaluated using a microarray-based study of cardiomyopathy in transgenic mice.
Collapse
Affiliation(s)
- Mark R Segal
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA 94143-0560, USA.
| | | | | |
Collapse
|
1100
|
Simon R. Supervised analysis when the number of candidate features (p) greatly exceeds the number of cases (n). ACTA ACUST UNITED AC 2003. [DOI: 10.1145/980972.980978] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
New genomic and proteomic technologies provide measurements of thousands of features for each case. This provides a context for enhanced discovery and false discovery. Most statistical and machine learning procedures were not developed for the p>>n setting and the literature of DNA microarray studies contains many examples of mis-use of analytic and computatinal methods such a cross-validation. This paper highlights some of key aspects of p>>n problems for identifying informative features and developing accurate classifiers.
Collapse
|