851
|
Park T, Kim K, Yi SG, Kim JH, Lee YS, Lee S. Spot intensity ratio statistics in two-channel microarray experiments. J Bioinform Comput Biol 2007; 5:865-73. [PMID: 17787060 DOI: 10.1142/s0219720007002928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2006] [Revised: 04/02/2007] [Accepted: 04/11/2007] [Indexed: 11/18/2022]
Abstract
In two-channel microarray experiments, the image analysis extracts red and green fluorescence intensities. The ratio of the two fluorescence intensities represents the relative abundance of the corresponding DNA sequence. The subsequent analysis is performed by taking a log-transformation of this ratio. Therefore, the statistical analyses depend on accuracy of the ratios calculated from the image analysis. However, not many studies have been proposed for developing more reliable ratio statistics. In this paper, we consider a new type of log-transformed ratio statistic. We compare the new ratio statistic with the conventional ratio statistic commonly used in two-channel microarray experiments. First, under the specific log-normal distributional assumption, we compare analytically the new statistics with the conventional ratio statistic. Second, we compare those ratio statistics using a two-channel microarray data obtained by hybridizing a mixture of mouse RNA and yeast in vitro transcript (IVT). Both comparisons show that the proposed ratio statistic performs better than the conventional one.
Collapse
Affiliation(s)
- Taesung Park
- Department of Statistics, Seoul National University, Seoul, Korea.
| | | | | | | | | | | |
Collapse
|
852
|
Parker BJ, Günter S, Bedo J. Stratification bias in low signal microarray studies. BMC Bioinformatics 2007; 8:326. [PMID: 17764577 PMCID: PMC2211509 DOI: 10.1186/1471-2105-8-326] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2007] [Accepted: 09/02/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.
Collapse
Affiliation(s)
- Brian J Parker
- Statistical Machine Learning Group, NICTA, Canberra, Australia
- Life Sciences Group, NICTA, Melbourne, Australia
- Research School of Information Sciences and Engineering, Australian National University, Canberra, Australia
| | - Simon Günter
- Statistical Machine Learning Group, NICTA, Canberra, Australia
- Research School of Information Sciences and Engineering, Australian National University, Canberra, Australia
| | - Justin Bedo
- Statistical Machine Learning Group, NICTA, Canberra, Australia
- Life Sciences Group, NICTA, Melbourne, Australia
- Research School of Information Sciences and Engineering, Australian National University, Canberra, Australia
| |
Collapse
|
853
|
Zhang C, Fu H, Jiang Y, Yu T. High-dimensional pseudo-logistic regression and classification with applications to gene expression data. Comput Stat Data Anal 2007. [DOI: 10.1016/j.csda.2006.12.033] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
854
|
|
855
|
Class prediction and gene selection for DNA microarrays using regularized sliced inverse regression. Comput Stat Data Anal 2007. [DOI: 10.1016/j.csda.2007.02.005] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
856
|
Forrester S, Hung KE, Kuick R, Kucherlapati R, Haab BB. Low-volume, high-throughput sandwich immunoassays for profiling plasma proteins in mice: identification of early-stage systemic inflammation in a mouse model of intestinal cancer. Mol Oncol 2007; 1:216-25. [PMID: 19305640 PMCID: PMC2658882 DOI: 10.1016/j.molonc.2007.06.001] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2007] [Revised: 06/01/2007] [Accepted: 06/01/2007] [Indexed: 12/20/2022] Open
Abstract
Mouse models of human cancers may provide a valuable resource for the discovery of cancer biomarkers. We have developed a practical strategy for profiling specific proteins in mouse plasma using low-volume sandwich-immunoassays. We used this method to profile the levels of 14 different cytokines, acute-phase reactants, and other cancer markers in plasma from a mouse models of intestinal tumors and their wild-type littermates, using as little as 1.5 microliters of diluted plasma per assay. Many of the proteins were significantly and consistently up-regulated in the mutant mice. The mutant mice could be distinguished nearly perfectly from the wild-type mice based on the combined levels of as few as three markers. Many of the proteins were up-regulated even in the mutant mice with few or no tumors, suggesting the presence of a systemic host response at an early stage of cancer development. These results have implications for the study of host responses in mouse models of cancers and demonstrate the value of a new low-volume, high-throughput sandwich-immunoassay method for sensitively profiling protein levels in cancer.
Collapse
Affiliation(s)
- Sara Forrester
- Van Andel Research Institute, 333 Bostwick, Grand Rapids, MI 49503, USA
| | - Kenneth E. Hung
- Partners Healthcare Center for Genetics and Genomics, Harvard Medical School, Boston, MA 02115, USA
| | - Rork Kuick
- University of Michigan Cancer Center Biostatistics Cores, University of Michigan, Ann Arbor, MI 48109, USA
| | - Raju Kucherlapati
- Partners Healthcare Center for Genetics and Genomics, Harvard Medical School, Boston, MA 02115, USA
| | - Brian B. Haab
- Van Andel Research Institute, 333 Bostwick, Grand Rapids, MI 49503, USA
| |
Collapse
|
857
|
Li J, Su H, Chen H, Futscher BW. Optimal search-based gene subset selection for gene array cancer classification. ACTA ACUST UNITED AC 2007; 11:398-405. [PMID: 17674622 DOI: 10.1109/titb.2007.892693] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
High dimensionality has been a major problem for gene array-based cancer classification. It is critical to identify marker genes for cancer diagnoses. We developed a framework of gene selection methods based on previous studies. This paper focuses on optimal search-based subset selection methods because they evaluate the group performance of genes and help to pinpoint global optimal set of marker genes. Notably, this paper is the first to introduce tabu search (TS) to gene selection from high-dimensional gene array data. Our comparative study of gene selection methods demonstrated the effectiveness of optimal search-based gene subset selection to identify cancer marker genes. TS was shown to be a promising tool for gene subset selection.
Collapse
Affiliation(s)
- Jiexun Li
- Artificial Intelligence Laboratory, Department of Management Information Systems, Eller College of Management, University of Arizona, Tucson, AZ 85721, USA.
| | | | | | | |
Collapse
|
858
|
Sundaresh S, Randall A, Unal B, Petersen JM, Belisle JT, Hartley MG, Duffield M, Titball RW, Davies DH, Felgner PL, Baldi P. From protein microarrays to diagnostic antigen discovery: a study of the pathogen Francisella tularensis. ACTA ACUST UNITED AC 2007; 23:i508-18. [PMID: 17646338 DOI: 10.1093/bioinformatics/btm207] [Citation(s) in RCA: 70] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION An important application of protein microarray data analysis is identifying a serodiagnostic antigen set that can reliably detect patterns and classify antigen expression profiles. This work addresses this problem using antibody responses to protein markers measured by a novel high-throughput microarray technology. The findings from this study have direct relevance to rapid, broad-based diagnostic and vaccine development. RESULTS Protein microarray chips are probed with sera from individuals infected with the bacteria Francisella tularensis, a category A biodefense pathogen. A two-step approach to the diagnostic process is presented (1) feature (antigen) selection and (2) classification using antigen response measurements obtained from F.tularensis microarrays (244 antigens, 46 infected and 54 healthy human sera measurements). To select antigens, a ranking scheme based on the identification of significant immune responses and differential expression analysis is described. Classification methods including k-nearest neighbors, support vector machines (SVM) and k-Means clustering are applied to training data using selected antigen sets of various sizes. SVM based models yield prediction accuracy rates in the range of approximately 90% on validation data, when antigen set sizes are between 25 and 50. These results strongly indicate that the top-ranked antigens can be considered high-priority candidates for diagnostic development. AVAILABILITY All software programs are written in R and available at http://www.igb.uci.edu/index.php?page=tools and at http://www.r-project.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Suman Sundaresh
- School of Information and Computer Sciences, University of California, Irvine, CA, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
859
|
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23:2507-17. [PMID: 17720704 DOI: 10.1093/bioinformatics/btm344] [Citation(s) in RCA: 2012] [Impact Index Per Article: 111.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.
Collapse
Affiliation(s)
- Yvan Saeys
- Department of Plant Systems Biology, VIB, B-9052 Ghent, Belgium.
| | | | | |
Collapse
|
860
|
Moon H, Ahn H, Kodell RL, Baek S, Lin CJ, Chen JJ. Ensemble methods for classification of patients for personalized medicine with high-dimensional data. Artif Intell Med 2007; 41:197-207. [PMID: 17719213 DOI: 10.1016/j.artmed.2007.07.003] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2006] [Revised: 06/18/2007] [Accepted: 07/06/2007] [Indexed: 10/22/2022]
Abstract
OBJECTIVE Personalized medicine is defined by the use of genomic signatures of patients in a target population for assignment of more effective therapies as well as better diagnosis and earlier interventions that might prevent or delay disease. An objective is to find a novel classification algorithm that can be used for prediction of response to therapy in order to help individualize clinical assignment of treatment. METHODS AND MATERIALS Classification algorithms are required to be highly accurate for optimal treatment on each patient. Typically, there are numerous genomic and clinical variables over a relatively small number of patients, which presents challenges for most traditional classification algorithms to avoid over-fitting the data. We developed a robust classification algorithm for high-dimensional data based on ensembles of classifiers built from the optimal number of random partitions of the feature space. The software is available on request from the authors. RESULTS The proposed algorithm is applied to genomic data sets on lymphoma patients and lung cancer patients to distinguish disease subtypes for optimal treatment and to genomic data on breast cancer patients to identify patients most likely to benefit from adjuvant chemotherapy after surgery. The performance of the proposed algorithm is consistently ranked highly compared to the other classification algorithms. CONCLUSION The statistical classification method for individualized treatment of diseases developed in this study is expected to play a critical role in developing safer and more effective therapies that replace one-size-fits-all drugs with treatments that focus on specific patient needs.
Collapse
MESH Headings
- Adenocarcinoma/diagnosis
- Adenocarcinoma/genetics
- Adenocarcinoma/therapy
- Algorithms
- Breast Neoplasms/diagnosis
- Breast Neoplasms/genetics
- Breast Neoplasms/therapy
- Chemotherapy, Adjuvant
- Diagnosis, Computer-Assisted
- Female
- Gene Expression Regulation, Neoplastic
- Humans
- Lung Neoplasms/diagnosis
- Lung Neoplasms/genetics
- Lung Neoplasms/therapy
- Lymphoma, Large B-Cell, Diffuse/diagnosis
- Lymphoma, Large B-Cell, Diffuse/genetics
- Lymphoma, Large B-Cell, Diffuse/therapy
- Male
- Mesothelioma/diagnosis
- Mesothelioma/genetics
- Mesothelioma/therapy
- Models, Statistical
- Neoplasms/diagnosis
- Neoplasms/drug therapy
- Neoplasms/genetics
- Neoplasms/surgery
- Neoplasms/therapy
- Patient Selection
- Pleural Neoplasms/diagnosis
- Pleural Neoplasms/genetics
- Pleural Neoplasms/therapy
- Reproducibility of Results
- Software
- Treatment Outcome
Collapse
Affiliation(s)
- Hojin Moon
- Department of Mathematics and Statistics, California State University-Long Beach, 1250 Bellflower Blvd., Long Beach, CA 90840, USA.
| | | | | | | | | | | |
Collapse
|
861
|
Farcomeni A. A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion. Stat Methods Med Res 2007; 17:347-88. [PMID: 17698936 DOI: 10.1177/0962280206079046] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In the last decade a growing amount of statistical research has been devoted to multiple testing, motivated by a variety of applications in medicine, bioinformatics, genomics, brain imaging, etc. Research in this area is focused on developing powerful procedures even when the number of tests is very large. This paper attempts to review research in modern multiple hypothesis testing with particular attention to the false discovery proportion, loosely defined as the number of false rejections divided by the number of rejections. We review the main ideas, stepwise and augmentation procedures; and resampling based testing. We also discuss the problem of dependence among the test statistics. Simulations make a comparison between the procedures and with Bayesian methods. We illustrate the procedures in applications in DNA microarray data analysis. Finally, few possibilities for further research are highlighted.
Collapse
|
862
|
Ando T, Konishi S. Nonlinear logistic discrimination via regularized radial basis functions for classifying high-dimensional data. ANN I STAT MATH 2007. [DOI: 10.1007/s10463-007-0143-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
863
|
Banaś K, Jasiński A, Banaś AM, Gajda M, Dyduch G, Pawlicki B, Kwiatek WM. Application of Linear Discriminant Analysis in Prostate Cancer Research by Synchrotron Radiation-Induced X-Ray Emission. Anal Chem 2007; 79:6670-4. [PMID: 17672524 DOI: 10.1021/ac070931u] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The ability to visualize an object of interest is one of the cornerstones of advancement in science. For this reason, synchrotron radiation-induced X-ray emission (micro-SRIXE) holds special promise as a imaging technique in structural biology, biochemistry, and medicine. It gives the possibility to image concentration of most of the elements in a sample at high space resolution. Statistical analysis of data obtained for samples of prostate tissues in an experiment at L-beam line HASYLAB (Hamburg, Germany) is presented in this paper. The regions for the measurements were selected according to the histological view of the sample. By histological examination, samples were divided into five groups (from healthy to Gleason4, most advanced stage of cancerogenesis). Data obtained in micro-SRIXE experiments on prostate cancer samples provide information about concentrations of certain elements in these groups. The rising problem is to find out concentrations of which elements allow the researcher to discriminate between different (early mentioned) groups. Linear discriminant analysis, a basic technique for feature extraction, was used in statistical analysis of the data. Our results indicate that the use of synchrotron radiation and discriminant analysis in the study of prostate cancer tissues provide information that can be key to better understanding of biomolecular functions.
Collapse
Affiliation(s)
- Krzysztof Banaś
- Institute of Nuclear Physics PAN, Radzikowskiego 152, 31-342, Kraków, Poland.
| | | | | | | | | | | | | |
Collapse
|
864
|
Kosorok MR, Ma S. Marginal asymptotics for the “large $p$, small $n$” paradigm: With applications to microarray data. Ann Stat 2007. [DOI: 10.1214/009053606000001433] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
865
|
Schott JR. A test for the equality of covariance matrices when the dimension is large relative to the sample sizes. Comput Stat Data Anal 2007. [DOI: 10.1016/j.csda.2007.03.004] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
866
|
Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL. Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal 2007. [DOI: 10.1016/j.csda.2006.12.043] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
867
|
Comparison of Functions for Filtering Time Course Gene Expression Data with Flat Patterns. KOREAN JOURNAL OF APPLIED STATISTICS 2007. [DOI: 10.5351/kjas.2007.20.2.409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
868
|
Fan J, Niu Y. Selection and validation of normalization methods for c-DNA microarrays using within-array replications. Bioinformatics 2007; 23:2391-8. [PMID: 17660210 DOI: 10.1093/bioinformatics/btm361] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Normalization of microarray data is essential for multiple-array analyses. Several normalization protocols have been proposed based on different biological or statistical assumptions. A fundamental problem arises whether they have effectively normalized arrays. In addition, for a given array, the question arises how to choose a method to most effectively normalize the microarray data. RESULTS We propose several techniques to compare the effectiveness of different normalization methods. We approach the problem by constructing statistics to test whether there are any systematic biases in the expression profiles among duplicated spots within an array. The test statistics involve estimating the genewise variances. This is accomplished by using several novel methods, including empirical Bayes methods for moderating the genewise variances and the smoothing methods for aggregating variance information. P-values are estimated based on a normal or chi approximation. With estimated P-values, we can choose a most appropriate method to normalize a specific array and assess the extent to which the systematic biases due to the variations of experimental conditions have been removed. The effectiveness and validity of the proposed methods are convincingly illustrated by a carefully designed simulation study. The method is further illustrated by an application to human placenta cDNAs comprising a large number of clones with replications, a customized microarray experiment carrying just a few hundred genes on the study of the molecular roles of Interferons on tumor, and the Agilent microarrays carrying tens of thousands of total RNA samples in the MAQC project on the study of reproducibility, sensitivity and specificity of the data. AVAILABILITY Code to implement the method in the statistical package R is available from the authors.
Collapse
Affiliation(s)
- Jianqing Fan
- Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544, USA.
| | | |
Collapse
|
869
|
Huang J, Gusnanto A, O'Sullivan K, Staaf J, Borg A, Pawitan Y. Robust smooth segmentation approach for array CGH data analysis. Bioinformatics 2007; 23:2463-9. [PMID: 17660206 DOI: 10.1093/bioinformatics/btm359] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Array comparative genomic hybridization (aCGH) provides a genome-wide technique to screen for copy number alteration. The existing segmentation approaches for analyzing aCGH data are based on modeling data as a series of discrete segments with unknown boundaries and unknown heights. Although the biological process of copy number alteration is discrete, in reality a variety of biological and experimental factors can cause the signal to deviate from a stepwise function. To take this into account, we propose a smooth segmentation (smoothseg) approach. METHODS To achieve a robust segmentation, we use a doubly heavy-tailed random-effect model. The first heavy-tailed structure on the errors deals with outliers in the observations, and the second deals with possible jumps in the underlying pattern associated with different segments. We develop a fast and reliable computational procedure based on the iterative weighted least-squares algorithm with band-limited matrix inversion. RESULTS Using simulated and real data sets, we demonstrate how smoothseg can aid in identification of regions with genomic alteration and in classification of samples. For the real data sets, smoothseg leads to smaller false discovery rate and classification error rate than the circular binary segmentation (CBS) algorithm. In a realistic simulation setting, smoothseg is better than wavelet smoothing and CBS in identification of regions with genomic alterations and better than CBS in classification of samples. For comparative analyses, we demonstrate that segmenting the t-statistics performs better than segmenting the data. AVAILABILITY The R package smoothseg to perform smooth segmentation is available from http://www.meb.ki.se/~yudpaw.
Collapse
Affiliation(s)
- Jian Huang
- Statistical Laboratory, Department of Statistics, University College Cork, Ireland
| | | | | | | | | | | |
Collapse
|
870
|
Abstract
In high-dimensional data analysis, sliced inverse regression (SIR) has proven to be an effective dimension reduction tool and has enjoyed wide applications. The usual SIR, however, cannot work with problems where the number of predictors, p, exceeds the sample size, n, and can suffer when there is high collinearity among the predictors. In addition, the reduced dimensional space consists of linear combinations of all the original predictors and no variable selection is achieved. In this article, we propose a regularized SIR approach based on the least-squares formulation of SIR. The L2 regularization is introduced, and an alternating least-squares algorithm is developed, to enable SIR to work with n < p and highly correlated predictors. The L1 regularization is further introduced to achieve simultaneous reduction estimation and predictor selection. Both simulations and the analysis of a microarray expression data set demonstrate the usefulness of the proposed method.
Collapse
Affiliation(s)
- Lexin Li
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
| | | |
Collapse
|
871
|
Modlich O, Munnes M. Statistical framework for gene expression data analysis. Methods Mol Biol 2007; 377:111-30. [PMID: 17634612 DOI: 10.1007/978-1-59745-390-5_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
DNA (mRNA) microarray, a highly promising technique with a variety of applications, can yield a wealth of data about each sample, well beyond the reach of every individual's comprehension. A need exists for statistical approaches that reliably eliminate insufficient and uninformative genes (probe sets) from further analysis while keeping all essentially important genes. This procedure does call for in-depth knowledge of the biological system to analyze. We conduct a comparative study of several statistical approaches on our own breast cancer Affymetrix microarray datasets. The strategy is designed primarily as a filter to select subsets of genes relevant for classification. We outline a general framework based on different statistical algorithms for determining a high-performing multigene predictor of response to the preoperative treatment of patients. We hope that our approach will provide straightforward and useful practical guidance for identification of genes, which can discriminate between biologically relevant classes in microarray datasets.
Collapse
Affiliation(s)
- Olga Modlich
- Institute of Chemical Oncology, University of Düsseldorf, Düsseldorf, Germany
| | | |
Collapse
|
872
|
Park J, Wilbur JD, Ghosh JK, Nakatsu CH, Ackerman C. Selection of Binary Variables and Classification by Boosting. COMMUN STAT-SIMUL C 2007. [DOI: 10.1080/03610910701419729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
873
|
Yip AM, Ng MK, Wu EH, Chan TF. Strategies for identifying statistically significant dense regions in microarray data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:415-29. [PMID: 17666761 DOI: 10.1109/tcbb.2007.1022] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
We propose and study the notion of dense regions for the analysis of categorized gene expression data and present some searching algorithms for discovering them. The algorithms can be applied to any categorical data matrices derived from gene expression level matrices. We demonstrate that dense regions are simple but useful and statistically significant patterns that can be used to 1) identify genes and/or samples of interest and 2) eliminate genes and/or samples corresponding to outliers, noise, or abnormalities. Some theoretical studies on the properties of the dense regions are presented which allow us to characterize dense regions into several classes and to derive tailor-made algorithms for different classes of regions. Moreover, an empirical simulation study on the distribution of the size of dense regions is carried out which is then used to assess the significance of dense regions and to derive effective pruning methods to speed up the searching algorithms. Real microarray data sets are employed to test our methods. Comparisons with six other well-known clustering algorithms using synthetic and real data are also conducted which confirm the superiority of our methods in discovering dense regions. The DRIFT code and a tutorial are available as supplemental material, which can be found on the Computer Society Digital Library at http://computer.org/tcbb/archives.htm.
Collapse
Affiliation(s)
- Andy M Yip
- Department of Mathematics, National University of Singapore, 2, Science Drive 2, Singapore 117543, Singapore.
| | | | | | | |
Collapse
|
874
|
Huang D, Chow T. Effective gene selection method with small sample sets using gradient-based and point injection techniques. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:467-475. [PMID: 17666766 DOI: 10.1109/tcbb.2007.1021] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
Microarray gene expression data usually consist of a large amount of genes. Among these genes, only a small fraction is informative for performing cancer diagnostic test. This paper focuses on effective identification of informative genes. We analyze gene selection models from the perspective of optimization theory. As a result, a new strategy is designed to modify conventional search engines. Also, as overfitting is likely to occur in microarray data because of their small sample set, a point injection technique is developed to address the problem of overfitting. The proposed strategies have been evaluated on three kinds of cancer diagnosis. Our results show that the proposed strategies can improve the performance of gene selection substantially. The experimental results also indicate that the proposed methods are very robust under all the investigated cases.
Collapse
|
875
|
Zhang R, Huang GB, Sundararajan N, Saratchandran P. Multi-category classification using an Extreme Learning Machine for microarray gene expression cancer diagnosis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:485-495. [PMID: 17666768 DOI: 10.1109/tcbb.2007.1012] [Citation(s) in RCA: 86] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
In this paper, the recently developed Extreme Learning Machine (ELM) is used for direct multicategory classification problems in the cancer diagnosis area. ELM avoids problems like local minima, improper learning rate and overfitting commonly faced by iterative learning methods and completes the training very fast. We have evaluated the multi-category classification performance of ELM on three benchmark microarray datasets for cancer diagnosis, namely, the GCM dataset, the Lung dataset and the Lymphoma dataset. The results indicate that ELM produces comparable or better classification accuracies with reduced training time and implementation complexity compared to artificial neural networks methods like conventional back-propagation ANN, Linder's SANN, and Support Vector Machine methods like SVM-OVO and Ramaswamy's SVM-OVA. ELM also achieves better accuracies for classification of individual categories.
Collapse
|
876
|
Mullins M, Perreard L, Quackenbush JF, Gauthier N, Bayer S, Ellis M, Parker J, Perou CM, Szabo A, Bernard PS. Agreement in Breast Cancer Classification between Microarray and Quantitative Reverse Transcription PCR from Fresh-Frozen and Formalin-Fixed, Paraffin-Embedded Tissues. Clin Chem 2007; 53:1273-9. [PMID: 17525107 DOI: 10.1373/clinchem.2006.083725] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Abstract
Background: Microarray studies have identified different molecular subtypes of breast cancer with prognostic significance. To transition these classifications into the clinical laboratory, we have developed a real-time quantitative reverse transcription (qRT)-PCR assay to diagnose the biological subtypes of breast cancer from fresh-frozen (FF) and formalin-fixed, paraffin-embedded (FFPE) tissues.
Methods: We used microarray data from 124 breast samples as a training set for classifying tumors into 4 previously defined molecular subtypes: Luminal, HER2+/ER−, basal-like, and normal-like. We used the training set data in 2 different centroid-based algorithms to predict sample class on 35 breast tumors (test set) procured as FF and FFPE tissues (70 samples). We classified samples on the basis of large and minimized gene sets. We used the minimized gene set in a real-time qRT-PCR assay to predict sample subtype from the FF and FFPE tissues. We evaluated primer set performance between procurement methods by use of several measures of agreement.
Results: The centroid-based algorithms were in complete agreement in classification from FFPE tissues by use of qRT-PCR and the minimized “intrinsic” gene set (40 classifiers). There was 94% (33 of 35) concordance between the diagnostic algorithms when comparing subtype classification from FF tissue by use of microarray (large and minimized gene set) and qRT-PCR data. We found that the ratio of the diagonal SD to the dynamic range was the best method for assessing agreement on a gene-by-gene basis.
Conclusions: Centroid-based algorithms are robust classifiers for breast cancer subtype assignment across platforms and procurement conditions.
Collapse
Affiliation(s)
- Michael Mullins
- Department of Pathology, University of Utah School of Medicine, Salt Lake City, UT, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
877
|
Kobayashi H, Akitomi J, Fujii N, Kobayashi K, Altaf-Ul-Amin M, Kurokawa K, Ogasawara N, Kanaya S. The entire organization of transcription units on the Bacillus subtilis genome. BMC Genomics 2007; 8:197. [PMID: 17598888 PMCID: PMC1925097 DOI: 10.1186/1471-2164-8-197] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2007] [Accepted: 06/28/2007] [Indexed: 11/17/2022] Open
Abstract
Background In the post-genomic era, comprehension of cellular processes and systems requires global and non-targeted approaches to handle vast amounts of biological information. Results The present study predicts transcription units (TUs) in Bacillus subtilis, based on an integrated approach involving DNA sequence and transcriptome analyses. First, co-expressed gene clusters are predicted by calculating the Pearson correlation coefficients of adjacent genes for all the genes in a series that are transcribed in the same direction with no intervening gene transcribed in the opposite direction. Transcription factor (TF) binding sites are then predicted by detecting statistically significant TF binding sequences on the genome using a position weight matrix. This matrix is a convenient way to identify sites that are more highly conserved than others in the entire genome because any sequence that differs from a consensus sequence has a lower score. We identify genes regulated by each of the TFs by comparing gene expression between wild-type and TF mutants using a one-sided test. By applying the integrated approach to 11 σ factors and 17 TFs of B. subtilis, we are able to identify fewer candidates for genes regulated by the TFs than were identified using any single approach, and also detect the known TUs efficiently. Conclusion This integrated approach is, therefore, an efficient tool for narrowing searches for candidate genes regulated by TFs, identifying TUs, and estimating roles of the σ factors and TFs in cellular processes and functions of genes composing the TUs.
Collapse
Affiliation(s)
- Hirokazu Kobayashi
- Department of Bioinformatics and Genomes, Graduate School of Information Sciences, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192, Japan
| | - Joe Akitomi
- Department of Bioinformatics and Genomes, Graduate School of Information Sciences, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192, Japan
| | - Nobuyuki Fujii
- Department of Bioinformatics and Genomes, Graduate School of Information Sciences, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192, Japan
| | - Kazuo Kobayashi
- Department of Bioinformatics and Genomes, Graduate School of Information Sciences, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192, Japan
| | - Md Altaf-Ul-Amin
- Department of Bioinformatics and Genomes, Graduate School of Information Sciences, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192, Japan
| | - Ken Kurokawa
- Department of Bioinformatics and Genomes, Graduate School of Information Sciences, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192, Japan
| | - Naotake Ogasawara
- Department of Bioinformatics and Genomes, Graduate School of Information Sciences, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192, Japan
| | - Shigehiko Kanaya
- Department of Bioinformatics and Genomes, Graduate School of Information Sciences, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara 630-0192, Japan
| |
Collapse
|
878
|
Hardin J, Mitani A, Hicks L, VanKoten B. A robust measure of correlation between two genes on a microarray. BMC Bioinformatics 2007; 8:220. [PMID: 17592643 PMCID: PMC1929126 DOI: 10.1186/1471-2105-8-220] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2007] [Accepted: 06/25/2007] [Indexed: 11/14/2022] Open
Abstract
Background The underlying goal of microarray experiments is to identify gene expression patterns across different experimental conditions. Genes that are contained in a particular pathway or that respond similarly to experimental conditions could be co-expressed and show similar patterns of expression on a microarray. Using any of a variety of clustering methods or gene network analyses we can partition genes of interest into groups, clusters, or modules based on measures of similarity. Typically, Pearson correlation is used to measure distance (or similarity) before implementing a clustering algorithm. Pearson correlation is quite susceptible to outliers, however, an unfortunate characteristic when dealing with microarray data (well known to be typically quite noisy.) Results We propose a resistant similarity metric based on Tukey's biweight estimate of multivariate scale and location. The resistant metric is simply the correlation obtained from a resistant covariance matrix of scale. We give results which demonstrate that our correlation metric is much more resistant than the Pearson correlation while being more efficient than other nonparametric measures of correlation (e.g., Spearman correlation.) Additionally, our method gives a systematic gene flagging procedure which is useful when dealing with large amounts of noisy data. Conclusion When dealing with microarray data, which are known to be quite noisy, robust methods should be used. Specifically, robust distances, including the biweight correlation, should be used in clustering and gene network analysis.
Collapse
Affiliation(s)
- Johanna Hardin
- Department of Mathematics, Pomona College, Claremont, CA 91711, USA
| | - Aya Mitani
- Department of Mathematics, Pitzer College, Claremont, CA 91711, USA
| | - Leanne Hicks
- Department of Statistics, University of Nebraska, Lincoln, NE 68588, USA
| | - Brian VanKoten
- Department of Mathematics, Lewis and Clark College, Portland, OR 97219, USA
| |
Collapse
|
879
|
Selecting dissimilar genes for multi-class classification, an application in cancer subtyping. BMC Bioinformatics 2007; 8:206. [PMID: 17573973 PMCID: PMC1914361 DOI: 10.1186/1471-2105-8-206] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2007] [Accepted: 06/16/2007] [Indexed: 11/10/2022] Open
Abstract
Background Gene expression microarray is a powerful technology for genetic profiling diseases and their associated treatments. Such a process involves a key step of biomarker identification, which are expected to be closely related to the disease. A most important task of these identified genes is that they can be used to construct a classifier which can effectively diagnose disease and even recognize the disease subtypes. Binary classification, for example, diseased or healthy, in microarray data analysis has been successful, while multi-class classification, such as cancer subtyping, remains challenging. Results We target on the challenging multi-class classification in microarray data analysis, especially on the cancer subtyping using gene expression microarray. We present a novel class discrimination strength vector to represent individual genes and introduce a new measurement to quantify the class discrimination strength difference between two genes. Such a new distance measure is employed in gene clustering, and subsequently the gene cluster information is exploited to select a set of genes which can be used to construct a sample classifier. We tested our method on four real cancer microarray datasets each contains multiple subtypes of cancer patients. The experimental results show that the constructed classifiers all achieved a higher classification accuracy than the previously best classification results obtained on these four datasets. Additional tests show that the selected genes by our method are less correlated and they all contribute statistically significantly to the more accurate cancer subtyping. Conclusion The proposed novel class discrimination strength vector is a better representation than the gene expression vector, in the sense that it can be used to effectively eliminate highly correlated but redundant genes for classifier construction. Such a method can build a classifier to achieve a higher classification accuracy, which is demonstrated via cancer subtyping.
Collapse
|
880
|
Abstract
Standard clustering algorithms when applied to DNA microarray data often tend to produce erroneous clusters. A major contributor to this divergence is the feature characteristic of microarray data sets that the number of predictors (genes) in such data far exceeds the number of samples by many orders of magnitude, with only a small percentage of predictors being truly informative with regards to the clustering while the rest merely add noise. An additional complication is that the predictors exhibit an unknown complex correlational configuration embedded in a small subspace of the entire predictor space. Under these conditions, standard clustering algorithms fail to find the true clusters even when applied in tandem with some sort of gene filtering or dimension reduction to reduce the number of predictors. We propose, as an alternative, a novel method for unsupervised classification of DNA microarray data. The method, which is based on the idea of aggregating results obtained from an ensemble of randomly resampled data (where both samples and genes are resampled), introduces a way of tilting the procedure so that the ensemble includes minimal representation from less important areas of the gene predictor space. The method produces a measure of dissimilarity between each pair of samples that can be used in conjunction with (a) a method like Ward's procedure to generate a cluster analysis and (b) multidimensional scaling to generate useful visualizations of the data. We call the dissimilarity measures ABC dissimilarities since they are obtained by aggregating bundles of clusters. An extensive comparison of several clustering methods using actual DNA microarray data convincingly demonstrates that classification using ABC dissimilarities offers significantly superior performance.
Collapse
Affiliation(s)
- Dhammika Amaratunga
- Johnson & Johnson Pharmaceutical Research & Development LLC, Raritan, NJ 08869-0602, USA.
| | | | | |
Collapse
|
881
|
Zhou X, Tuck DP. MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. ACTA ACUST UNITED AC 2007; 23:1106-14. [PMID: 17494773 DOI: 10.1093/bioinformatics/btm036] [Citation(s) in RCA: 115] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Given the thousands of genes and the small number of samples, gene selection has emerged as an important research problem in microarray data analysis. Support Vector Machine-Recursive Feature Elimination (SVM-RFE) is one of a group of recently described algorithms which represent the stat-of-the-art for gene selection. Just like SVM itself, SVM-RFE was originally designed to solve binary gene selection problems. Several groups have extended SVM-RFE to solve multiclass problems using one-versus-all techniques. However, the genes selected from one binary gene selection problem may reduce the classification performance in other binary problems. RESULTS In the present study, we propose a family of four extensions to SVM-RFE (called MSVM-RFE) to solve the multiclass gene selection problem, based on different frameworks of multiclass SVMs. By simultaneously considering all classes during the gene selection stages, our proposed extensions identify genes leading to more accurate classification.
Collapse
Affiliation(s)
- Xin Zhou
- Department of Pathology, Yale University School of Medicine, New Haven, Connecticut 06510, USA
| | | |
Collapse
|
882
|
Phan JH, Quo CF, Wang MD. Comparative study of microarray data for cancer research. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2004:2960-3. [PMID: 17270899 DOI: 10.1109/iembs.2004.1403840] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In comparison to traditional "single-gene" study method such as reverse transcriptase-polymerase chain reaction (RT-PCR), microarray technology can produce high-throughout gene expression data simultaneously. The advancement of this technology also presents a big challenge. In cancer research, the issue is how to identify the signature genes, or biomarkers associated with particular cancer to perform precise, objective and systematic cancer diagnosis and treatment. More specifically, the goal is how to accurately analyze and interpret the resulting large amount of gene expression data with relatively small patient sample size. As such, we have been developing a novel multischeme system that can derive optimal decision based on the best utilization of gene expression data features and clinical, and biological knowledge. In the paper, we are reporting the results of the first phase development of our novel system, to use unsupervised clustering methods to discover gene relationship and to use knowledge-based supervised classification to get highly accurate prediction in cancer diagnosis and prognosis study. This work sets up foundation for our next step drug target study.
Collapse
Affiliation(s)
- John H Phan
- Wallace H. Coulter Dept. of Biomed. Eng., Emory Univ., Atlanta, GA, USA
| | | | | |
Collapse
|
883
|
Ooi CH, Chetty M, Teng SW. Characteristics of predictor sets found using differential prioritization. Algorithms Mol Biol 2007; 2:7. [PMID: 17547742 PMCID: PMC1920513 DOI: 10.1186/1748-7188-2-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2006] [Accepted: 06/04/2007] [Indexed: 12/31/2022] Open
Abstract
Background Feature selection plays an undeniably important role in classification problems involving high dimensional datasets such as microarray datasets. For filter-based feature selection, two well-known criteria used in forming predictor sets are relevance and redundancy. However, there is a third criterion which is at least as important as the other two in affecting the efficacy of the resulting predictor sets. This criterion is the degree of differential prioritization (DDP), which varies the emphases on relevance and redundancy depending on the value of the DDP. Previous empirical works on publicly available microarray datasets have confirmed the effectiveness of the DDP in molecular classification. We now propose to establish the fundamental strengths and merits of the DDP-based feature selection technique. This is to be done through a simulation study which involves vigorous analyses of the characteristics of predictor sets found using different values of the DDP from toy datasets designed to mimic real-life microarray datasets. Results A simulation study employing analytical measures such as the distance between classes before and after transformation using principal component analysis is implemented on toy datasets. From these analyses, the necessity of adjusting the differential prioritization based on the dataset of interest is established. This conclusion is supported by comparisons against both simplistic rank-based selection and state-of-the-art equal-priorities scoring methods, which demonstrates the superiority of the DDP-based feature selection technique. Reapplying similar analyses to real-life multiclass microarray datasets provides further confirmation of our findings and of the significance of the DDP for practical applications. Conclusion The findings have been achieved based on analytical evaluations, not empirical evaluation involving classifiers, thus providing further basis for the usefulness of the DDP and validating the need for unequal priorities on relevance and redundancy during feature selection for microarray datasets, especially highly multiclass datasets.
Collapse
Affiliation(s)
- Chia Huey Ooi
- Gippsland School of Information Technology, Monash University, Churchill, VIC 3842, Australia
| | - Madhu Chetty
- Gippsland School of Information Technology, Monash University, Churchill, VIC 3842, Australia
| | - Shyh Wei Teng
- Gippsland School of Information Technology, Monash University, Churchill, VIC 3842, Australia
| |
Collapse
|
884
|
Liao JG, Chin KV. Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics 2007; 23:1945-51. [PMID: 17540680 DOI: 10.1093/bioinformatics/btm287] [Citation(s) in RCA: 86] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate. RESULTS We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models. AVAILABILITY R library GeneLogit at http://geocities.com/jg_liao
Collapse
Affiliation(s)
- J G Liao
- Drexel University School of Public Health, Philadelphia, PA 19102, USA.
| | | |
Collapse
|
885
|
Asyali MH. Gene expression profile class prediction using linear Bayesian classifiers. Comput Biol Med 2007; 37:1690-9. [PMID: 17517385 DOI: 10.1016/j.compbiomed.2007.04.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2006] [Revised: 03/23/2007] [Accepted: 04/09/2007] [Indexed: 11/16/2022]
Abstract
Due to recent advances in DNA microarray technology, using gene expression profiles, diagnostic category of tissue samples can be predicted with high accuracy. In this study, we discuss shortcomings of some existing gene expression profile classification methods and propose a new approach based on linear Bayesian classifiers. In our approach, we first construct gene-level linear classifiers to identify genes that provide high class-prediction accuracies, i.e., low error rates. After this screening phase, starting with the gene that offers the lowest error rate, we construct a multi-dimensional linear classifier by incorporating next best-performing genes, until the prediction error becomes minimum or 0, if possible. When we compared classification performance of our approach against prediction analysis of microarrays (PAM) and support vector machines (SVM) based approaches, we found that our method outperforms PAM and produces comparable results with SVM. In addition, we observed that the gene selection scheme of PAM could be misleading. Albeit SVM achieves relatively higher prediction performance, it has two major disadvantages: Complexity and lack of insight about important genes. Our intuitive approach offers competing performance and also an efficient means for finding important genes.
Collapse
Affiliation(s)
- Musa H Asyali
- Department of Computer Engineering, Yasar University, Kazim Dirik Mah. 364 Sok. No: 5, Bornova 35500, Izmir, Turkey.
| |
Collapse
|
886
|
Li H, Zhang K, Jiang T. Robust and accurate cancer classification with gene expression profiling. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2007:310-21. [PMID: 16447988 DOI: 10.1109/csb.2005.49] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix S(w) be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher's criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when S(w) is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of S(w), and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.
Collapse
Affiliation(s)
- Haifeng Li
- Dept. of Computer Science, University of California at Riverside, Riverside, CA 92521, USA.
| | | | | |
Collapse
|
887
|
Recursive cluster elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics 2007; 8:144. [PMID: 17474999 PMCID: PMC1877816 DOI: 10.1186/1471-2105-8-144] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2006] [Accepted: 05/02/2007] [Indexed: 11/10/2022] Open
Abstract
Background Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE. Results We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights. Conclusion SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups. Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful.
Collapse
|
888
|
Zhang X, Wei D, Yap Y, Li L, Guo S, Chen F. Mass spectrometry-based "omics" technologies in cancer diagnostics. MASS SPECTROMETRY REVIEWS 2007; 26:403-31. [PMID: 17405143 DOI: 10.1002/mas.20132] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
Many "omics" techniques have been developed for one goal: biomarker discovery and early diagnosis of human cancers. A comprehensive review of mass spectrometry-based "omics" approaches performed on various biological samples for molecular diagnosis of human cancers is presented in this article. Furthermore, the existing and potential problems/solutions (both de facto experimental and bioinformatic challenges), and future prospects have been extensively discussed. Although the use of present omic methods as diagnostic tools are still in their infant stage and consequently not ready for immediate clinical use, it can be envisaged that the "omics"-based cancer diagnostics will gradually enter into the clinic in next 10 years as an important supplement to current clinical diagnostics.
Collapse
Affiliation(s)
- Xuewu Zhang
- College of Light Industry and Food Sciences, South China University of Technology, Guangzhou, China.
| | | | | | | | | | | |
Collapse
|
889
|
Zhou X, Yu T, Cole SW, Wong DTW. Advancement in characterization of genomic alterations for improved diagnosis, treatment and prognostics in cancer. Expert Rev Mol Diagn 2007; 6:39-50. [PMID: 16359266 DOI: 10.1586/14737159.6.1.39] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Most human cancers are characterized by genetic instabilities. These instabilities manifest themselves as a series of genetic alterations, including discrete mutations and chromosomal aberrations. With the human genome deciphered, high-throughput technologies are rapidly advancing the field to generate genome-wide gene expression and mutation profiles that are highly correlative of biologic and disease phenotypes. While recent advancement in comprehensive genomic characterization presents an unprecedented opportunity for advancing the treatment of cancer, there are still many challenges that need to be overcome before we can fully utilize genomic markers and targets for cancer prediction, diagnostics, treatment and prognostics. This review describes recent advances in comprehensive genomic characterization at the DNA level, and considers some of the challenges that remain for defining the precise genomic portrait of tumors. Potential solutions that may help overcome these challenges are also offered.
Collapse
Affiliation(s)
- Xiaofeng Zhou
- Dental Research Institute, School of Dentistry & Jonsson Comprehensive Cancer Center, University of California at Los Angeles, Los Angeles, CA, USA.
| | | | | | | |
Collapse
|
890
|
Lu Y, Tian Q, Liu F, Sanchez M, Wang Y. Interactive semisupervised learning for microarray analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:190-203. [PMID: 17473313 DOI: 10.1109/tcbb.2007.070206] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Microarray technology has generated vast amounts of gene expression data with distinct patterns. Based on the premise that genes of correlated functions tend to exhibit similar expression patterns, various machine learning methods have been applied to capture these specific patterns in microarray data. However, the discrepancy between the rich expression profiles and the limited knowledge of gene functions has been a major hurdle to the understanding of cellular networks. To bridge this gap so as to properly comprehend and interpret expression data, we introduce Relevance Feedback to microarray analysis and propose an interactive learning framework to incorporate the expert knowledge into the decision module. In order to find a good learning method and solve two intrinsic problems in microarray data, high dimensionality and small sample size, we also propose a semisupervised learning algorithm: Kernel Discriminant-EM (KDEM). This algorithm efficiently utilizes a large set of unlabeled data to compensate for the insufficiency of a small set of labeled data and it extends the linear algorithm in Discriminant-EM (DEM) to a kernel algorithm to handle nonlinearly separable data in a lower dimensional space. The Relevance Feedback technique and KDEM together construct an efficient and effective interactive semisupervised learning framework for microarray analysis. Extensive experiments on the yeast cell cycle regulation data set and Plasmodium falciparum red blood cell cycle data set show the promise of this approach.
Collapse
Affiliation(s)
- Yijuan Lu
- Department of Computer Science, University of Texas at San Antonio, Texas 78249-1644 , USA.
| | | | | | | | | |
Collapse
|
891
|
Bontempi G. A blocking strategy to improve gene selection for classification of gene expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:293-300. [PMID: 17473321 DOI: 10.1109/tcbb.2007.1014] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Because of high dimensionality, machine learning algorithms typically rely on feature selection techniques in order to perform effective classification in microarray gene expression data sets. However, the large number of features compared to the number of samples makes the task of feature selection computationally hard and prone to errors. This paper interprets feature selection as a task of stochastic optimization, where the goal is to select among an exponential number of alternative gene subsets the one expected to return the highest generalization in classification. Blocking is an experimental design strategy which produces similar experimental conditions to compare alternative stochastic configurations in order to be confident that observed differences in accuracy are due to actual differences rather than to fluctuations and noise effects. We propose an original blocking strategy for improving feature selection which aggregates in a paired way the validation outcomes of several learning algorithms to assess a gene subset and compare it to others. This is a novelty with respect to conventional wrappers, which commonly adopt a sole learning algorithm to evaluate the relevance of a given set of variables. The rationale of the approach is that, by increasing the amount of experimental conditions under which we validate a feature subset, we can lessen the problems related to the scarcity of samples and consequently come up with a better selection. The paper shows that the blocking strategy significantly improves the performance of a conventional forward selection for a set of 16 publicly available cancer expression data sets. The experiments involve six different classifiers and show that improvements take place independent of the classification algorithm used after the selection step. Two further validations based on available biological annotation support the claim that blocking strategies in feature selection may improve the accuracy and the quality of the solution. The first validation is based on retrieving PubMEd abstracts associated to the selected genes and matching them to regular expressions describing the biological phenomenon underlying the expression data sets. The biological validation that follows is based on the use of the Bioconductor package GoStats in order to perform Gene Ontology statistical analysis.
Collapse
Affiliation(s)
- Gianluca Bontempi
- Départment d'Informative,Université Libre de Bruxelles, Bruxelles, Belgium.
| |
Collapse
|
892
|
Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics 2007; 8:111. [PMID: 17397530 PMCID: PMC1858704 DOI: 10.1186/1471-2105-8-111] [Citation(s) in RCA: 79] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2006] [Accepted: 03/30/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pearson correlation coefficient. RESULTS Relying on several public gene expression datasets, we evaluate the homogeneity and separation scores of different clustering solutions. It was found that the use of the MI measure yields a more significant differentiation among erroneous clustering solutions. The proposed measure was also used to analyze the performance of several known clustering algorithms. A comparative study of these algorithms reveals that their "best solutions" are ranked almost oppositely when using different distance measures, despite the found correspondence between these measures when analysing the averaged scores of groups of solutions. CONCLUSION In view of the results, further attention should be paid to the selection of a proper distance measure for analyzing the clustering of gene expression data.
Collapse
|
893
|
Abstract
MOTIVATION The nearest shrunken centroid (NSC) method has been successfully applied in many DNA-microarray classification problems. The NSC uses 'shrunken' centroids as prototypes for each class and identifies subsets of genes that best characterize each class. Classification is then made to the nearest (shrunken) centroid. The NSC is very easy to implement and very easy to interpret, however, it has drawbacks. RESULTS We show that the NSC method can be interpreted in the framework of LASSO regression. Based on that, we consider two new methods, adaptive L(infinity)-norm penalized NSC (ALP-NSC) and adaptive hierarchically penalized NSC (AHP-NSC), with two different penalty functions for microarray classification, which improve over the NSC. Unlike the L(1)-norm penalty used in LASSO, the penalty terms that we consider make use of the fact that parameters belonging to one gene should be treated as a natural group. Numerical results indicate that the two new methods tend to remove irrelevant genes more effectively and provide better classification results than the L(1)-norm approach. AVAILABILITY R code for the ALP-NSC and the AHP-NSC algorithms are available from authors upon request.
Collapse
Affiliation(s)
- Sijian Wang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | | |
Collapse
|
894
|
Ma S, Song X, Huang J. Supervised group Lasso with applications to microarray data analysis. BMC Bioinformatics 2007; 8:60. [PMID: 17316436 PMCID: PMC1821041 DOI: 10.1186/1471-2105-8-60] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2006] [Accepted: 02/22/2007] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND A tremendous amount of efforts have been devoted to identifying genes for diagnosis and prognosis of diseases using microarray gene expression data. It has been demonstrated that gene expression data have cluster structure, where the clusters consist of co-regulated genes which tend to have coordinated functions. However, most available statistical methods for gene selection do not take into consideration the cluster structure. RESULTS We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we first divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method. The supervised group Lasso consists of two steps. In the first step, we identify important genes within each cluster using the Lasso method. In the second step, we select important clusters using the group Lasso. Tuning parameters are determined using V-fold cross validation at both steps to allow for further flexibility. Prediction performance is evaluated using leave-one-out cross validation. We apply the proposed method to disease classification and survival analysis with microarray data. CONCLUSION We analyze four microarray data sets using the proposed approach: two cancer data sets with binary cancer occurrence as outcomes and two lymphoma data sets with survival outcomes. The results show that the proposed approach is capable of identifying a small number of influential gene clusters and important genes within those clusters, and has better prediction performance than existing methods.
Collapse
Affiliation(s)
- Shuangge Ma
- Department of Epidemiology and Public Health, Yale University, New Haven, CT 06520, USA
| | - Xiao Song
- Department of Health Administration, Biostatistics and Epidemiology, University of Georgia, Athens, GA 30602, USA
| | - Jian Huang
- Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52242, USA
| |
Collapse
|
895
|
Zhang X, Li L, Wei D, Yap Y, Chen F. Moving cancer diagnostics from bench to bedside. Trends Biotechnol 2007; 25:166-73. [PMID: 17316853 DOI: 10.1016/j.tibtech.2007.02.006] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2006] [Revised: 01/11/2007] [Accepted: 02/08/2007] [Indexed: 12/27/2022]
Abstract
To improve treatment and reduce the mortality from cancer, a key task is to detect the disease as early as possible. To achieve this, many new technologies have been developed for biomarker discovery and validation. This review provides an overview of omics technologies in biomarker discovery and cancer detection, and highlights recent applications and future trends in cancer diagnostics. Although the present omic methods are not ready for immediate clinical use as diagnostic tools, it can be envisaged that simple, fast, robust, portable and cost-effective clinical diagnosis systems could be available in near future, for home and bedside use.
Collapse
Affiliation(s)
- Xuewu Zhang
- College of Light Industry and Food Sciences, South China University of Technology, 381 Wushan Road, Guangzhou 510640, China.
| | | | | | | | | |
Collapse
|
896
|
Pinsky PF. Scaling of True and Apparent ROC AUC with Number of Observations and Number of Variables. COMMUN STAT-SIMUL C 2007. [DOI: 10.1081/sac-200068366] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Paul F. Pinsky
- a Division of Cancer Prevention , National Cancer Institute , Bethesda , Maryland , USA
| |
Collapse
|
897
|
Sander O, Sing T, Sommer I, Low AJ, Cheung PK, Harrigan PR, Lengauer T, Domingues FS. Structural descriptors of gp120 V3 loop for the prediction of HIV-1 coreceptor usage. PLoS Comput Biol 2007; 3:e58. [PMID: 17397254 PMCID: PMC1848001 DOI: 10.1371/journal.pcbi.0030058] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2006] [Accepted: 02/08/2007] [Indexed: 12/12/2022] Open
Abstract
HIV-1 cell entry commonly uses, in addition to CD4, one of the chemokine receptors CCR5 or CXCR4 as coreceptor. Knowledge of coreceptor usage is critical for monitoring disease progression as well as for supporting therapy with the novel drug class of coreceptor antagonists. Predictive methods for inferring coreceptor usage based on the third hypervariable (V3) loop region of the viral gene coding for the envelope protein gp120 can provide us with these monitoring facilities while avoiding expensive phenotypic tests. All simple heuristics (such as the 11/25 rule) as well as statistical learning methods proposed to date predict coreceptor usage based on sequence features of the V3 loop exclusively. Here, we show, based on a recently resolved structure of gp120 with an untruncated V3 loop, that using structural information on the V3 loop in combination with sequence features of V3 variants improves prediction of coreceptor usage. In particular, we propose a distance-based descriptor of the spatial arrangement of physicochemical properties that increases discriminative performance. For a fixed specificity of 0.95, a sensitivity of 0.77 was achieved, improving further to 0.80 when combined with a sequence-based representation using amino acid indicators. This compares favorably with the sensitivities of 0.62 for the traditional 11/25 rule and 0.73 for a prediction based on sequence information as input to a support vector machine and constitutes a statistically significant improvement. A detailed analysis and interpretation of structural features important for classification shows the relevance of several specific hydrogen-bond donor sites and aliphatic side chains to coreceptor specificity towards CCR5 or CXCR4. Furthermore, an analysis of side chain orientation of the specificity-determining residues suggests a major role of one side of the V3 loop in the selection of the coreceptor. The proposed method constitutes the first approach to an improved prediction of coreceptor usage based on an original integration of structural bioinformatics methods with statistical learning.
Collapse
Affiliation(s)
- Oliver Sander
- Max-Planck-Institute for Informatics, Saarbrücken, Germany.
| | | | | | | | | | | | | | | |
Collapse
|
898
|
Ooi CH, Chetty M, Teng SW. Differential prioritization in feature selection and classifier aggregation for multiclass microarray datasets. Data Min Knowl Discov 2007. [DOI: 10.1007/s10618-006-0055-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
899
|
Nicolau M, Tibshirani R, Børresen-Dale AL, Jeffrey SS. Disease-specific genomic analysis: identifying the signature of pathologic biology. ACTA ACUST UNITED AC 2007; 23:957-65. [PMID: 17277331 DOI: 10.1093/bioinformatics/btm033] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
MOTIVATION Genomic high-throughput technology generates massive data, providing opportunities to understand countless facets of the functioning genome. It also raises profound issues in identifying data relevant to the biology being studied. RESULTS We introduce a method for the analysis of pathologic biology that unravels the disease characteristics of high dimensional data. The method, disease-specific genomic analysis (DSGA), is intended to precede standard techniques like clustering or class prediction, and enhance their performance and ability to detect disease. DSGA measures the extent to which the disease deviates from a continuous range of normal phenotypes, and isolates the aberrant component of data. In several microarray cancer datasets, we show that DSGA outperforms standard methods. We then use DSGA to highlight a novel subdivision of an important class of genes in breast cancer, the estrogen receptor (ER) cluster. We also identify new markers distinguishing ductal and lobular breast cancers. Although our examples focus on microarrays, DSGA generalizes to any high dimensional genomic/proteomic data.
Collapse
Affiliation(s)
- Monica Nicolau
- Department of Surgery, Stanford University School of Medicine, Stanford University, Stanford, CA, USA
| | | | | | | |
Collapse
|
900
|
Abstract
The intent of this article is to discuss some of the complexities of toxicogenomics data and the statistical design and analysis issues that arise in the course of conducting a toxicogenomics study. We also describe a procedure for classifying compounds into various hepatotoxicity classes based on gene expression data. The methodology involves first classifying a compound as toxic or nontoxic and subsequently classifying the toxic compounds into the hepatotoxicity classes, based on votes by binary classifiers. The binary classifiers are constructed by using genes selected to best elicit differences between the two classes. We show that the gene selection strategy improves the misclassification error rates and also delivers gene pathways that exhibit biological relevance.
Collapse
Affiliation(s)
- Nandini Raghavan
- Department of Non-Clinical Biostatistics, Johnson and Johnson Pharmaceutical Research and Development, LLC, Raritan, New Jersey 08869, USA.
| | | | | | | |
Collapse
|