1001
|
Liang Y, Kelemen A. Associating phenotypes with molecular events: recent statistical advances and challenges underpinning microarray experiments. Funct Integr Genomics 2005; 6:1-13. [PMID: 16292543 DOI: 10.1007/s10142-005-0006-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2005] [Revised: 06/22/2005] [Accepted: 08/16/2005] [Indexed: 10/25/2022]
Abstract
Progress in mapping the genome and developments in array technologies have provided large amounts of information for delineating the roles of genes involved in complex diseases and quantitative traits. Since complex phenotypes are determined by a network of interrelated biological traits typically involving multiple inter-correlated genetic and environmental factors that interact in a hierarchical fashion, microarrays hold tremendous latent information. The analysis of microarray data is, however, still a bottleneck. In this paper, we review the recent advances in statistical analyses for associating phenotypes with molecular events underpinning microarray experiments. Classical statistical procedures to analyze phenotypes in genetics are reviewed first, followed by descriptions of the statistical procedures for linking molecular events to measured gene expression phenotypes (microarray-based gene expression) and observed phenotypes such as diseases status. These statistical procedures include (1) prior analysis, such as data quality controls, and normalization analyses for minimizing the effects of experimental artifacts and random noise; (2) gene selections and differentiation procedures based on inferential statistics for the class comparisons; (3) dynamic temporal patterns analysis through exploratory statistics such as unsupervised clustering and supervised classification and predictions; (4) assessing the reliability of microarray studies using real-time PCR and the reproducibility issues from many studies and multiple platforms. In addition, the post analysis to associate the discovered patterns of gene expression to pathway and functional analysis for selected genes are also considered in order to increase our understanding of interconnected gene processes.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Biostatistics, The State University of New York at Buffalo, Buffalo, NY 14214, USA.
| | | |
Collapse
|
1002
|
Segal MR. Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited. Biostatistics 2005; 7:268-85. [PMID: 16284340 DOI: 10.1093/biostatistics/kxj006] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Diffuse large-B-cell lymphoma (DLBCL) is an aggressive malignancy of mature B lymphocytes and is the most common type of lymphoma in adults. While treatment advances have been substantial in what was formerly a fatal disease, less than 50% of patients achieve lasting remission. In an effort to predict treatment success and explain disease heterogeneity clinical features have been employed for prognostic purposes, but have yielded only modest predictive performance. This has spawned a series of high-profile microarray-based gene expression studies of DLBCL, in the hope that molecular-level information could be used to refine prognosis. The intent of this paper is to reevaluate these microarray-based prognostic assessments, and extend the statistical methodology that has been used in this context. Methodological challenges arise in using patients' gene expression profiles to predict survival endpoints on account of the large number of genes and their complex interdependence. We initially focus on the Lymphochip data and analysis of Rosenwald et al. (2002). After describing relationships between the analyses performed and gene harvesting (Hastie et al., 2001a), we argue for the utility of penalized approaches, in particular least angle regression-least absolute shrinkage and selection operator (Efron et al., 2004). While these techniques have been extended to the proportional hazards/partial likelihood framework, the resultant algorithms are computationally burdensome. We develop residual-based approximations that eliminate this burden yet perform similarly. Comparisons of predictive accuracy across both methods and studies are effected using time-dependent receiver operating characteristic curves. These indicate that gene expression data, in turn, only delivers modest predictions of posttherapy DLBCL survival. We conclude by outlining possibilities for further work.
Collapse
Affiliation(s)
- Mark R Segal
- Division of Biostatistics, University of California, San Francisco, CA 94143-0560, USA.
| |
Collapse
|
1003
|
Niijima S, Kuhara S. Multiclass molecular cancer classification by kernel subspace methods with effective kernel parameter selection. J Bioinform Comput Biol 2005; 3:1071-88. [PMID: 16278948 DOI: 10.1142/s0219720005001491] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2004] [Revised: 05/31/2005] [Accepted: 06/04/2005] [Indexed: 11/18/2022]
Abstract
Microarray techniques provide new insights into molecular classification of cancer types, which is critical for cancer treatments and diagnosis. Recently, an increasing number of supervised machine learning methods have been applied to cancer classification problems using gene expression data. Support vector machines (SVMs), in particular, have become one of the most effective and leading methods. However, there exist few studies on the application of other kernel methods in the literature. We apply a kernel subspace (KS) method to multiclass cancer classification problems, and assess its validity by comparing it with multiclass SVMs. Our comparative study using seven multiclass cancer datasets demonstrates that the KS method has high performance that is comparable to multiclass SVMs. Furthermore, we propose an effective criterion for kernel parameter selection, which is shown to be useful for the computation of the KS method.
Collapse
Affiliation(s)
- Satoshi Niijima
- Department of Bioinformatics, Graduate School of Systems Life Sciences, Kyushu University, Hakozaki 6-10-1, Higashi-ku, Fukuoka 812-8581, Japan.
| | | |
Collapse
|
1004
|
Abstract
SUMMARY ClaNC (classification to nearest centroids) is a simple and an accurate method for classifying microarrays. This document introduces a point-and-click interface to the ClaNC methodology. The software is available as an R package. AVAILABILITY ClaNC is freely available from http://students.washington.edu/adabney/clanc
Collapse
Affiliation(s)
- Alan R Dabney
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
| |
Collapse
|
1005
|
Zhang W, Rekaya R, Bertrand K. A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer. ACTA ACUST UNITED AC 2005; 22:317-25. [PMID: 16267079 DOI: 10.1093/bioinformatics/bti738] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION An accurate diagnostic and prediction will not be achieved unless the disease subtype status for every training sample used in the supervised learning step is accurately known. Such an assumption requires the existence of a perfect tool for disease diagnostic and classification, which is seldom available in the majority of the cases. Thus, the supervised learning step has to be conducted with a statistical model that contemplates and handles potential mislabeling in the input data. RESULTS A procedure for handling potential mislabeling among training samples in the prediction of disease subtypes using gene expression data was proposed. A real data-based simulation study about the estrogen receptor status (ER+/ER-) of breast cancer patients was conducted. The results demonstrated that when 1-4 training samples (N = 30) were artificially mislabeled, the proposed method was able not only in correcting the ER status of mislabeled training samples but also more importantly in predicting the ER status of validation samples as well as using 'true' training data.
Collapse
Affiliation(s)
- Wensheng Zhang
- Department of Animal and Dairy Science, University of Georgia, Athens, GA 30602, USA
| | | | | |
Collapse
|
1006
|
Woll K, Borsuk LA, Stransky H, Nettleton D, Schnable PS, Hochholdinger F. Isolation, characterization, and pericycle-specific transcriptome analyses of the novel maize lateral and seminal root initiation mutant rum1. PLANT PHYSIOLOGY 2005; 139:1255-67. [PMID: 16215225 PMCID: PMC1283763 DOI: 10.1104/pp.105.067330] [Citation(s) in RCA: 127] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The monogenic recessive maize (Zea mays) mutant rootless with undetectable meristems 1 (rum1) is deficient in the initiation of the embryonic seminal roots and the postembryonic lateral roots at the primary root. Lateral root initiation at the shoot-borne roots and development of the aerial parts of the mutant rum1 are not affected. The mutant rum1 displays severely reduced auxin transport in the primary root and a delayed gravitropic response. Exogenously applied auxin does not induce lateral roots in the primary root of rum1. Lateral roots are initiated in a specific cell type, the pericycle. Cell-type-specific transcriptome profiling of the primary root pericycle 64 h after germination, thus before lateral root initiation, via a combination of laser capture microdissection and subsequent microarray analyses of 12k maize microarray chips revealed 90 genes preferentially expressed in the wild-type pericycle and 73 genes preferentially expressed in the rum1 pericycle (fold change >2; P-value <0.01; estimated false discovery rate of 13.8%). Among the 51 annotated genes predominately expressed in the wild-type pericycle, 19 genes are involved in signal transduction, transcription, and the cell cycle. This analysis defines an array of genes that is active before lateral root initiation and will contribute to the identification of checkpoints involved in lateral root formation downstream of rum1.
Collapse
Affiliation(s)
- Katrin Woll
- Center for Plant Molecular Biology, Department of General Genetics , Eberhard Karls University, 72076 Tuebingen, Germany
| | | | | | | | | | | |
Collapse
|
1007
|
Booth EO, Van Driessche N, Zhuchenko O, Kuspa A, Shaulsky G. Microarray phenotyping in Dictyostelium reveals a regulon of chemotaxis genes. Bioinformatics 2005; 21:4371-7. [PMID: 16234315 DOI: 10.1093/bioinformatics/bti726] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Coordinate regulation of gene expression can provide information on gene function. To begin a large-scale analysis of Dictyostelium gene function, we clustered genes based on their expression in wild-type and mutant strains and analyzed their functions. RESULTS We found 17 modes of wild-type gene expression and refined them into 57 submodes considering mutant data. Annotation analyses revealed correlations between co-expression and function and an unexpected correlation between expression and function of genes involved in various aspects of chemotaxis. Co-regulation of chemotaxis genes was also found in published data from neutrophils. To test the predictive power of the analysis, we examined the phenotypes of mutations in seven co-regulated genes that had no published role in chemotaxis. Six mutants exhibited chemotaxis defects, supporting the idea that function can be inferred from co-expression. The clustering and annotation analyses provide a public resource for Dictyostelium functional genomics.
Collapse
Affiliation(s)
- Ezgi O Booth
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, TX 77030, USA
| | | | | | | | | |
Collapse
|
1008
|
Ma S, Huang J. Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 2005; 21:4356-62. [PMID: 16234316 DOI: 10.1093/bioinformatics/bti724] [Citation(s) in RCA: 122] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease classification. Thus there is a need for developing statistical methods that can efficiently use such high-throughput genomic data, select biomarkers with discriminant power and construct classification rules. The ROC (receiver operator characteristic) technique has been widely used in disease classification with low-dimensional biomarkers because (1) it does not assume a parametric form of the class probability as required for example in the logistic regression method; (2) it accommodates case-control designs and (3) it allows treating false positives and false negatives differently. However, due to computational difficulties, the ROC-based classification has not been used with microarray data. Moreover, the standard ROC technique does not incorporate built-in biomarker selection. RESULTS We propose a novel method for biomarker selection and classification using the ROC technique for microarray data. The proposed method uses a sigmoid approximation to the area under the ROC curve as the objective function for classification and the threshold gradient descent regularization method for estimation and biomarker selection. Tuning parameter selection based on the V-fold cross validation and predictive performance evaluation are also investigated. The proposed approach is demonstrated with a simulation study, the Colon data and the Estrogen data. The proposed approach yields parsimonious models with excellent classification performance.
Collapse
Affiliation(s)
- Shuangge Ma
- Department of Biostatistics, University of Washington, Washington, USA
| | | |
Collapse
|
1009
|
Bhaskar H, Hoyle DC, Singh S. Machine learning in bioinformatics: a brief survey and recommendations for practitioners. Comput Biol Med 2005; 36:1104-25. [PMID: 16226240 DOI: 10.1016/j.compbiomed.2005.09.002] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Machine learning is used in a large number of bioinformatics applications and studies. The application of machine learning techniques in other areas such as pattern recognition has resulted in accumulated experience as to correct and principled approaches for their use. The aim of this paper is to give an account of issues affecting the application of machine learning tools, focusing primarily on general aspects of feature and model parameter selection, rather than any single specific algorithm. These aspects are discussed in the context of published bioinformatics studies in leading journals over the last 5 years. We assess to what degree the experience gained by the pattern recognition research community pervades these bioinformatics studies. We finally discuss various critical issues relating to bioinformatic data sets and make a number of recommendations on the proper use of machine learning techniques for bioinformatics research based upon previously published research on machine learning.
Collapse
Affiliation(s)
- Harish Bhaskar
- School of Engineering, Computer Science & Mathematics, University of Exeter, Exeter EX4 4QF, UK.
| | | | | |
Collapse
|
1010
|
Abstract
MOTIVATION Classification of biological samples by microarrays is a topic of much interest. A number of methods have been proposed and successfully applied to this problem. It has recently been shown that classification by nearest centroids provides an accurate predictor that may outperform much more complicated methods. The 'Prediction Analysis of Microarrays' (PAM) approach is one such example, which the authors strongly motivate by its simplicity and interpretability. In this spirit, I seek to assess the performance of classifiers simpler than even PAM. RESULTS I surprisingly show that the modified t-statistics and shrunken centroids employed by PAM tend to increase misclassification error when compared with their simpler counterparts. Based on these observations, I propose a classification method called 'Classification to Nearest Centroids' (ClaNC). ClaNC ranks genes by standard t-statistics, does not shrink centroids and uses a class-specific gene-selection procedure. Because of these modifications, ClaNC is arguably simpler and easier to interpret than PAM, and it can be viewed as a traditional nearest centroid classifier that uses specially selected genes. I demonstrate that ClaNC error rates tend to be significantly less than those for PAM, for a given number of active genes. AVAILABILITY Point-and-click software is freely available at http://students.washington.edu/adabney/clanc.
Collapse
Affiliation(s)
- Alan R Dabney
- Department of Biostatistics, University of Washington, Seattle, 98195, USA.
| |
Collapse
|
1011
|
Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. ACTA ACUST UNITED AC 2005; 21:4084-91. [PMID: 16159913 DOI: 10.1093/bioinformatics/bti677] [Citation(s) in RCA: 207] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Array comparative genomic hybridization (CGH) allows detection and mapping of copy number of DNA segments. A challenge is to make inferences about the copy number structure of the genome. Several statistical methods have been proposed to determine genomic segments with different copy number levels. However, to date, no comprehensive comparison of various characteristics of these methods exists. Moreover, the segmentation results have not been utilized in downstream analyses. RESULTS We describe a comparison of three popular and publicly available methods for the analysis of array CGH data and we demonstrate how segmentation results may be utilized in the downstream analyses such as testing and classification, yielding higher power and prediction accuracy. Since the methods operate on individual chromosomes, we also propose a novel procedure for merging segments across the genome, which results in an interpretable set of copy number levels, and thus facilitate identification of copy number alterations in each genome. AVAILABILITY http://www.bioconductor.org
Collapse
Affiliation(s)
- Hanni Willenbrock
- Center for Biological Sequence Analysis, Department of Biotechnology, Technical University of Denmark, Kgs. Lyngby
| | | |
Collapse
|
1012
|
Simon R. Roadmap for developing and validating therapeutically relevant genomic classifiers. J Clin Oncol 2005; 23:7332-41. [PMID: 16145063 DOI: 10.1200/jco.2005.02.8712] [Citation(s) in RCA: 312] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Oncologists need improved tools for selecting treatments for individual patients. The development of therapeutically relevant prognostic markers has traditionally been slowed by poor study design, inconsistent findings, and lack of proper validation studies. Microarray expression profiling provides an exciting new technology for relating tumor gene expression to patient outcome, but it also provides increased challenges for translating initial research findings into robust diagnostics that benefit patients and physicians in therapeutic decision making. This article attempts to clarify some of the misconceptions about the development and validation of multigene expression signature classifiers and highlights the steps needed to move genomic signatures into clinical application as therapeutically relevant and robust diagnostics.
Collapse
Affiliation(s)
- Richard Simon
- National Cancer Institute, 9000 Rockville Pike, MSC 7434, Bethesda, MD 20892, USA.
| |
Collapse
|
1013
|
Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci U S A 2005; 102:13550-5. [PMID: 16141321 PMCID: PMC1197273 DOI: 10.1073/pnas.0506230102] [Citation(s) in RCA: 935] [Impact Index Per Article: 46.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Perturbations of the p53 pathway are associated with more aggressive and therapeutically refractory tumors. However, molecular assessment of p53 status, by using sequence analysis and immunohistochemistry, are incomplete assessors of p53 functional effects. We posited that the transcriptional fingerprint is a more definitive downstream indicator of p53 function. Herein, we analyzed transcript profiles of 251 p53-sequenced primary breast tumors and identified a clinically embedded 32-gene expression signature that distinguishes p53-mutant and wild-type tumors of different histologies and outperforms sequence-based assessments of p53 in predicting prognosis and therapeutic response. Moreover, the p53 signature identified a subset of aggressive tumors absent of sequence mutations in p53 yet exhibiting expression characteristics consistent with p53 deficiency because of attenuated p53 transcript levels. Our results show the primary importance of p53 functional status in predicting clinical breast cancer behavior.
Collapse
Affiliation(s)
- Lance D Miller
- Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1014
|
Wang HQ, Huang DS. Non-linear cancer classification using a modified radial basis function classification algorithm. J Biomed Sci 2005; 12:819-26. [PMID: 16132112 DOI: 10.1007/s11373-005-9007-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2005] [Accepted: 07/01/2005] [Indexed: 10/25/2022] Open
Abstract
This paper proposes a modified radial basis function classification algorithm for non-linear cancer classification. In the algorithm, a modified simulated annealing method is developed and combined with the linear least square and gradient paradigms to optimize the structure of the radial basis function (RBF) classifier. The proposed algorithm can be adopted to perform non-linear cancer classification based on gene expression profiles and applied to two microarray data sets involving various human tumor classes: (1) Normal versus colon tumor; (2) acute myeloid leukemia (AML) versus acute lymphoblastic leukemia (ALL). Finally, accuracy and stability for the proposed algorithm are further demonstrated by comparing with the other cancer classification algorithms.
Collapse
Affiliation(s)
- Hong-Qiang Wang
- Intelligent Computation Lab, Hefei Institute of Intelligent Machines, Chinese Academy of Science, P.O. Box 1130, , Hefei, Anhui, 230031, China.
| | | |
Collapse
|
1015
|
Lottaz C, Spang R. stam--a Bioconductor compliant R package for structured analysis of microarray data. BMC Bioinformatics 2005; 6:211. [PMID: 16122395 PMCID: PMC1208856 DOI: 10.1186/1471-2105-6-211] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2005] [Accepted: 08/25/2005] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Genome wide microarray studies have the potential to unveil novel disease entities. Clinically homogeneous groups of patients can have diverse gene expression profiles. The definition of novel subclasses based on gene expression is a difficult problem not addressed systematically by currently available software tools. RESULTS We present a computational tool for semi-supervised molecular disease entity detection. It automatically discovers molecular heterogeneities in phenotypically defined disease entities and suggests alternative molecular sub-entities of clinical phenotypes. This is done using both gene expression data and functional gene annotations. We provide stam, a Bioconductor compliant software package for the statistical programming environment R. We demonstrate that our tool detects gene expression patterns, which are characteristic for only a subset of patients from an established disease entity. We call such expression patterns molecular symptoms. Furthermore, stam finds novel sub-group stratifications of patients according to the absence or presence of molecular symptoms. CONCLUSION Our software is easy to install and can be applied to a wide range of datasets. It provides the potential to reveal so far indistinguishable patient sub-groups of clinical relevance.
Collapse
Affiliation(s)
- Claudio Lottaz
- Max Planck Institute for Molecular Genetics and Berlin Center for Genome Based Bioinformatics, Ihnestr. 73, D-14195 Berlin, Germany
| | - Rainer Spang
- Max Planck Institute for Molecular Genetics and Berlin Center for Genome Based Bioinformatics, Ihnestr. 73, D-14195 Berlin, Germany
| |
Collapse
|
1016
|
Huang X, Pan W, Grindle S, Han X, Chen Y, Park SJ, Miller LW, Hall J. A comparative study of discriminating human heart failure etiology using gene expression profiles. BMC Bioinformatics 2005; 6:205. [PMID: 16120216 PMCID: PMC1224853 DOI: 10.1186/1471-2105-6-205] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2005] [Accepted: 08/24/2005] [Indexed: 11/23/2022] Open
Abstract
Background Human heart failure is a complex disease that manifests from multiple genetic and environmental factors. Although ischemic and non-ischemic heart disease present clinically with many similar decreases in ventricular function, emerging work suggests that they are distinct diseases with different responses to therapy. The ability to distinguish between ischemic and non-ischemic heart failure may be essential to guide appropriate therapy and determine prognosis for successful treatment. In this paper we consider discriminating the etiologies of heart failure using gene expression libraries from two separate institutions. Results We apply five new statistical methods, including partial least squares, penalized partial least squares, LASSO, nearest shrunken centroids and random forest, to two real datasets and compare their performance for multiclass classification. It is found that the five statistical methods perform similarly on each of the two datasets: it is difficult to correctly distinguish the etiologies of heart failure in one dataset whereas it is easy for the other one. In a simulation study, it is confirmed that the five methods tend to have close performance, though the random forest seems to have a slight edge. Conclusions For some gene expression data, several recently developed discriminant methods may perform similarly. More importantly, one must remain cautious when assessing the discriminating performance using gene expression profiles based on a small dataset; our analysis suggests the importance of utilizing multiple or larger datasets.
Collapse
Affiliation(s)
- Xiaohong Huang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Suzanne Grindle
- Cardiovascular Division, Department of Medicine, Medical School, University of Minnesota, Minneapolis, MN 55455, USA
| | - Xinqiang Han
- Cardiovascular Division, Department of Medicine, Medical School, University of Minnesota, Minneapolis, MN 55455, USA
| | - Yingjie Chen
- Cardiovascular Division, Department of Medicine, Medical School, University of Minnesota, Minneapolis, MN 55455, USA
| | - Soon J Park
- Cardiovascular Division, Department of Medicine, Medical School, University of Minnesota, Minneapolis, MN 55455, USA
| | - Leslie W Miller
- Cardiovascular Division, Department of Medicine, Medical School, University of Minnesota, Minneapolis, MN 55455, USA
| | - Jennifer Hall
- Cardiovascular Division, Department of Medicine, Medical School, University of Minnesota, Minneapolis, MN 55455, USA
| |
Collapse
|
1017
|
Gao WM, Kuick R, Orchekowski RP, Misek DE, Qiu J, Greenberg AK, Rom WN, Brenner DE, Omenn GS, Haab BB, Hanash SM. Distinctive serum protein profiles involving abundant proteins in lung cancer patients based upon antibody microarray analysis. BMC Cancer 2005; 5:110. [PMID: 16117833 PMCID: PMC1198221 DOI: 10.1186/1471-2407-5-110] [Citation(s) in RCA: 132] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2005] [Accepted: 08/23/2005] [Indexed: 01/29/2023] Open
Abstract
Background Cancer serum protein profiling by mass spectrometry has uncovered mass profiles that are potentially diagnostic for several common types of cancer. However, direct mass spectrometric profiling has a limited dynamic range and difficulties in providing the identification of the distinctive proteins. We hypothesized that distinctive profiles may result from the differential expression of relatively abundant serum proteins associated with the host response. Methods Eighty-four antibodies, targeting a wide range of serum proteins, were spotted onto nitrocellulose-coated microscope slides. The abundances of the corresponding proteins were measured in 80 serum samples, from 24 newly diagnosed subjects with lung cancer, 24 healthy controls, and 32 subjects with chronic obstructive pulmonary disease (COPD). Two-color rolling-circle amplification was used to measure protein abundance. Results Seven of the 84 antibodies gave a significant difference (p < 0.01) for the lung cancer patients as compared to healthy controls, as well as compared to COPD patients. Proteins that exhibited higher abundances in the lung cancer samples relative to the control samples included C-reactive protein (CRP; a 13.3 fold increase), serum amyloid A (SAA; a 2.0 fold increase), mucin 1 and α-1-antitrypsin (1.4 fold increases). The increased expression levels of CRP and SAA were validated by Western blot analysis. Leave-one-out cross-validation was used to construct Diagonal Linear Discriminant Analysis (DLDA) classifiers. At a cutoff where all 56 of the non-tumor samples were correctly classified, 15/24 lung tumor patient sera were correctly classified. Conclusion Our results suggest that a distinctive serum protein profile involving abundant proteins may be observed in lung cancer patients relative to healthy subjects or patients with chronic disease and may have utility as part of strategies for detecting lung cancer.
Collapse
Affiliation(s)
- Wei-Min Gao
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Critical Care Medicine, Safar Center for Resuscitation Research, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Rork Kuick
- Department of Pediatrics, University of Michigan, Ann Arbor, MI 48109, USA
| | | | - David E Misek
- Department of Pediatrics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Ji Qiu
- Department of Pediatrics, University of Michigan, Ann Arbor, MI 48109, USA
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Alissa K Greenberg
- Division of Pulmonary and Critical Care Medicine, NYU Cancer Institute, NYU School of Medicine NY, NY 10016, USA
| | - William N Rom
- Division of Pulmonary and Critical Care Medicine, NYU Cancer Institute, NYU School of Medicine NY, NY 10016, USA
| | - Dean E Brenner
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| | - Gilbert S Omenn
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| | - Brian B Haab
- Van Andel Research Institute, Grand Rapids, MI 49503, USA
| | - Samir M Hanash
- Department of Pediatrics, University of Michigan, Ann Arbor, MI 48109, USA
- Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| |
Collapse
|
1018
|
Berchuck A, Iversen ES, Lancaster JM, Pittman J, Luo J, Lee P, Murphy S, Dressman HK, Febbo PG, West M, Nevins JR, Marks JR. Patterns of gene expression that characterize long-term survival in advanced stage serous ovarian cancers. Clin Cancer Res 2005; 11:3686-96. [PMID: 15897565 DOI: 10.1158/1078-0432.ccr-04-2398] [Citation(s) in RCA: 204] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
PURPOSE A better understanding of the underlying biology of invasive serous ovarian cancer is critical for the development of early detection strategies and new therapeutics. The objective of this study was to define gene expression patterns associated with favorable survival. EXPERIMENTAL DESIGN RNA from 65 serous ovarian cancers was analyzed using Affymetrix U133A microarrays. This included 54 stage III/IV cases (30 short-term survivors who lived <3 years and 24 long-term survivors who lived >7 years) and 11 stage I/II cases. Genes were screened on the basis of their level of and variability in expression, leaving 7,821 for use in developing a predictive model for survival. A composite predictive model was developed that combines Bayesian classification tree and multivariate discriminant models. Leave-one-out cross-validation was used to select and evaluate models. RESULTS Patterns of genes were identified that distinguish short-term and long-term ovarian cancer survivors. The expression model developed for advanced stage disease classified all 11 early-stage ovarian cancers as long-term survivors. The MAL gene, which has been shown to confer resistance to cancer therapy, was most highly overexpressed in short-term survivors (3-fold compared with long-term survivors, and 29-fold compared with early-stage cases). These results suggest that gene expression patterns underlie differences in outcome, and an examination of the genes that provide this discrimination reveals that many are implicated in processes that define the malignant phenotype. CONCLUSIONS Differences in survival of advanced ovarian cancers are reflected by distinct patterns of gene expression. This biological distinction is further emphasized by the finding that early-stage cancers share expression patterns with the advanced stage long-term survivors, suggesting a shared favorable biology.
Collapse
Affiliation(s)
- Andrew Berchuck
- Department of Obstetrics and Gynecology/Division of Gynecologic Oncology, Institute of Statistics and Decision Sciences, Center for Applied Genomics and Technology, Duke University Medical Center, Durham, North Carolina, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1019
|
Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D. Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 2005; 21:3896-904. [PMID: 16105897 PMCID: PMC1987374 DOI: 10.1093/bioinformatics/bti631] [Citation(s) in RCA: 250] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Various studies have shown that cancer tissue samples can be successfully detected and classified by their gene expression patterns using machine learning approaches. One of the challenges in applying these techniques for classifying gene expression data is to extract accurate, readily interpretable rules providing biological insight as to how classification is performed. Current methods generate classifiers that are accurate but difficult to interpret. This is the trade-off between credibility and comprehensibility of the classifiers. Here, we introduce a new classifier in order to address these problems. It is referred to as k-TSP (k-Top Scoring Pairs) and is based on the concept of 'relative expression reversals'. This method generates simple and accurate decision rules that only involve a small number of gene-to-gene expression comparisons, thereby facilitating follow-up studies. RESULTS In this study, we have compared our approach to other machine learning techniques for class prediction in 19 binary and multi-class gene expression datasets involving human cancers. The k-TSP classifier performs as efficiently as Prediction Analysis of Microarray and support vector machine, and outperforms other learning methods (decision trees, k-nearest neighbour and naïve Bayes). Our approach is easy to interpret as the classifier involves only a small number of informative genes. For these reasons, we consider the k-TSP method to be a useful tool for cancer classification from microarray gene expression data. AVAILABILITY The software and datasets are available at http://www.ccbm.jhu.edu CONTACT actan@jhu.edu.
Collapse
Affiliation(s)
- Aik Choon Tan
- Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute, Baltimore, MD 21218, USA.
| | | | | | | | | |
Collapse
|
1020
|
Reid JF, Lusa L, De Cecco L, Coradini D, Veneroni S, Daidone MG, Gariboldi M, Pierotti MA. Limits of predictive models using microarray data for breast cancer clinical treatment outcome. J Natl Cancer Inst 2005; 97:927-30. [PMID: 15956654 DOI: 10.1093/jnci/dji153] [Citation(s) in RCA: 96] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Data from microarray studies have been used to develop predictive models for treatment outcome in breast cancer, such as a recently proposed predictive model for antiestrogen response after tamoxifen treatment that was based on the expression ratio of two genes. We attempted to validate this model on an independent cohort of 58 patients with resectable estrogen receptor-positive breast cancer. We measured expression of the genes HOXB13 and IL17BR with real time-quantitative polymerase chain reaction and assessed the association between their expression and outcome by use of univariate logistic regression, area under the receiver-operating-characteristic curve (AUC), a two-sample t test, and a Mann-Whitney test. We also applied standard supervised methods to the original microarray dataset and to another independent dataset from similar patients to estimate the classification accuracy obtainable by using more than two genes in a microarray-based predictive model. We could not validate the performance of the two-gene predictor on our cohort of samples (relation between outcome and the following genes estimated by logistic regression: for HOXB13, odds ratio [OR] = 1.04, 95% confidence interval [CI] = 0.92 to 1.16, P = .54; for IL17BR, OR = 0.69, 95% CI = 0.40 to 1.20, P = .18; and for HOXB13/IL17BR, OR = 1.30, 95% CI = 0.88 to 1.93, P = .18). Similar results were obtained with the AUC, a two-sample two-sided t test, and a Mann-Whitney test. In addition, estimates of classification accuracies applied to two independent microarray datasets highlighted the poor performance of treatment-response predictive models that can be achieved with the sample sizes of patients and informative genes to date.
Collapse
Affiliation(s)
- James F Reid
- Department of Experimental Oncology, Istituto Nazionale per lo Studio e la Cura dei Tumori, Milan, Italy.
| | | | | | | | | | | | | | | |
Collapse
|
1021
|
Exploratory Analysis of Gene Expression Data Using Biplot. KOREAN JOURNAL OF APPLIED STATISTICS 2005. [DOI: 10.5351/kjas.2005.18.2.355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
1022
|
Hirai MY, Klein M, Fujikawa Y, Yano M, Goodenowe DB, Yamazaki Y, Kanaya S, Nakamura Y, Kitayama M, Suzuki H, Sakurai N, Shibata D, Tokuhisa J, Reichelt M, Gershenzon J, Papenbrock J, Saito K. Elucidation of Gene-to-Gene and Metabolite-to-Gene Networks inArabidopsis by Integration of Metabolomics andTranscriptomics. J Biol Chem 2005; 280:25590-5. [PMID: 15866872 DOI: 10.1074/jbc.m502332200] [Citation(s) in RCA: 295] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Since the completion of genome sequences of model organisms, functional identification of unknown genes has become a principal challenge in biology. Post-genomics sciences such as transcriptomics, proteomics, and metabolomics are expected to discover gene functions. This report outlines the elucidation of gene-to-gene and metabolite-to-gene networks via integration of metabolomics with transcriptomics and presents a strategy for the identification of novel gene functions. Metabolomics and transcriptomics data of Arabidopsis grown under sulfur deficiency were combined and analyzed by batch-learning self-organizing mapping. A group of metabolites/genes regulated by the same mechanism clustered together. The metabolism of glucosinolates was shown to be coordinately regulated. Three uncharacterized putative sulfotransferase genes clustering together with known glucosinolate biosynthesis genes were candidates for involvement in biosynthesis. In vitro enzymatic assays of the recombinant gene products confirmed their functions as desulfoglucosinolate sulfotransferases. Several genes involved in sulfur assimilation clustered with O-acetylserine, which is considered a positive regulator of these genes. The genes involved in anthocyanin biosynthesis clustered with the gene encoding a transcriptional factor that up-regulates specifically anthocyanin biosynthesis genes. These results suggested that regulatory metabolites and transcriptional factor genes can be identified by this approach, based on the assumption that they cluster with the downstream genes they regulate. This strategy is applicable not only to plant but also to other organisms for functional elucidation of unknown genes.
Collapse
Affiliation(s)
- Masami Yokota Hirai
- Department of Molecular Biology and Biotechnology, Graduate School of Pharmaceutical Sciences, Chiba University, Chiba 263-8522, Japan
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1023
|
Boulesteix AL, Strimmer K. Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. Theor Biol Med Model 2005; 2:23. [PMID: 15978125 PMCID: PMC1182396 DOI: 10.1186/1742-4682-2-23] [Citation(s) in RCA: 81] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2005] [Accepted: 06/24/2005] [Indexed: 11/10/2022] Open
Abstract
Background The study of the network between transcription factors and their targets is important for understanding the complex regulatory mechanisms in a cell. Unfortunately, with standard microarray experiments it is not possible to measure the transcription factor activities (TFAs) directly, as their own transcription levels are subject to post-translational modifications. Results Here we propose a statistical approach based on partial least squares (PLS) regression to infer the true TFAs from a combination of mRNA expression and DNA-protein binding measurements. This method is also statistically sound for small samples and allows the detection of functional interactions among the transcription factors via the notion of "meta"-transcription factors. In addition, it enables false positives to be identified in ChIP data and activation and suppression activities to be distinguished. Conclusion The proposed method performs very well both for simulated data and for real expression and ChIP data from yeast and E. Coli experiments. It overcomes the limitations of previously used approaches to estimating TFAs. The estimated profiles may also serve as input for further studies, such as tests of periodicity or differential regulation. An R package "plsgenomics" implementing the proposed methods is available for download from the CRAN archive.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
| | - Korbinian Strimmer
- Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
| |
Collapse
|
1024
|
Detours V, Wattel S, Venet D, Hutsebaut N, Bogdanova T, Tronko MD, Dumont JE, Franc B, Thomas G, Maenhaut C. Absence of a specific radiation signature in post-Chernobyl thyroid cancers. Br J Cancer 2005; 92:1545-52. [PMID: 15812549 PMCID: PMC2362019 DOI: 10.1038/sj.bjc.6602521] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Thyroid cancers have been the main medical consequence of the Chernobyl accident. On the basis of their pathological features and of the fact that a large proportion of them demonstrate RET-PTC translocations, these cancers are considered as similar to classical sporadic papillary carcinomas, although molecular alterations differ between both tumours. We analysed gene expression in post-Chernobyl cancers, sporadic papillary carcinomas and compared to autonomous adenomas used as controls. Unsupervised clustering of these data did not distinguish between the cancers, but separates both cancers from adenomas. No gene signature separating sporadic from post-Chernobyl PTC (chPTC) could be found using supervised and unsupervised classification methods although such a signature is demonstrated for cancers and adenomas. Furthermore, we demonstrate that pooled RNA from sporadic and chPTC are as strongly correlated as two independent sporadic PTC pools, one from Europe, one from the US involving patients not exposed to Chernobyl radiations. This result relies on cDNA and Affymetrix microarrays. Thus, platform-specific artifacts are controlled for. Our findings suggest the absence of a radiation fingerprint in the chPTC and support the concept that post-Chernobyl cancer data, for which the cancer-causing event and its date are known, are a unique source of information to study naturally occurring papillary carcinomas.
Collapse
Affiliation(s)
- V Detours
- Institute of Interdisciplinary Research, School of Medicine, Free University of Brussels, Campus Erasme, route de Lennik 808, B-1070 Brussels, Belgium
| | - S Wattel
- Institute of Interdisciplinary Research, School of Medicine, Free University of Brussels, Campus Erasme, route de Lennik 808, B-1070 Brussels, Belgium
| | - D Venet
- Institute of Interdisciplinary Research, School of Medicine, Free University of Brussels, Campus Erasme, route de Lennik 808, B-1070 Brussels, Belgium
| | - N Hutsebaut
- Institute of Interdisciplinary Research, School of Medicine, Free University of Brussels, Campus Erasme, route de Lennik 808, B-1070 Brussels, Belgium
| | - T Bogdanova
- Institute of Endocrinology and Metabolism, 04114 Kiev, Ukraine
| | - M D Tronko
- Institute of Endocrinology and Metabolism, 04114 Kiev, Ukraine
| | - J E Dumont
- Institute of Interdisciplinary Research, School of Medicine, Free University of Brussels, Campus Erasme, route de Lennik 808, B-1070 Brussels, Belgium
| | - B Franc
- Service d'Anatomie et de Cytologie Pathologiques, Hôpital A Paré (AP-HP), Université de Versailles, St Quentin en Yvelines, France
| | - G Thomas
- South West Wales Cancer Institute/Swansea Clinical School, Singleton Hospital, Sketty Lane, Swansea SA2 8QA, UK
| | - C Maenhaut
- Institute of Interdisciplinary Research, School of Medicine, Free University of Brussels, Campus Erasme, route de Lennik 808, B-1070 Brussels, Belgium
- Institute of Interdisciplinary Research, School of Medicine, Free University of Brussels, Campus Erasme, route de Lennik 808, B-1070 Brussels, Belgium. E-mail:
| |
Collapse
|
1025
|
|
1026
|
|
1027
|
Bing N, Hoeschele I, Ye K, Eilertsen KJ. Finite mixture model analysis of microarray expression data on samples of uncertain biological type with application to reproductive efficiency. Vet Immunol Immunopathol 2005; 105:187-96. [PMID: 15808300 DOI: 10.1016/j.vetimm.2005.02.008] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Common goals of microarray experiments are the detection of genes that are differentially expressed between several biological types and the construction of classifiers that predict biological type of samples. Here we consider a situation where there is no training data. There is considerable interest in comparing expression profiles associated with successful pregnancies (SP) and unsuccessful pregnancies (UP) in model and farm animals. Successful pregnancy rate is known to be much higher in embryos generated by in vitro fertilization (IVF) than in nuclear transfer (NT) embryos, and higher under induced ovulation for large follicles (LF) than for small follicles (SF). The tasks of identifying genes differentially expressed between SP and UP, and predicting SP for future samples are not well accomplished by comparing IVF and NT, or LF and SF. A suitable method is finite mixture model analysis (FMMA), which models each observed class (IVF and NT, or LF and SF) as a mixture of two distributions, one for SP and one for UP, with different known or unknown proportions (here known to be 0.50 SP for IVF and 0.02 SP for NT). The means of the two distributions differ for the differentially expressed genes, which we identify via a likelihood ratio test. We confirm by simulation that FMMA strongly outperforms hierarchical clustering and linear discriminant analysis using the known class labels (NT, IVF). We apply FMMA to a real data set on IVF and NT embryos, and compute their posterior probabilities of SP, which confirm our prior knowledge of the SP proportions for IVF and NT.
Collapse
Affiliation(s)
- Nan Bing
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24061-0477, USA
| | | | | | | |
Collapse
|
1028
|
|
1029
|
Ye J, Li Q. A two-stage linear discriminant analysis via QR-decomposition. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2005; 27:929-41. [PMID: 15943424 DOI: 10.1109/tpami.2005.110] [Citation(s) in RCA: 76] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
Linear Discriminant Analysis (LDA) is a well-known method for feature extraction and dimension reduction. It has been used widely in many applications involving high-dimensional data, such as image and text classification. An intrinsic limitation of classical LDA is the so-called singularity problems; that is, it fails when all scatter matrices are singular. Many LDA extensions were proposed in the past to overcome the singularity problems. Among these extensions, PCA+LDA, a two-stage method, received relatively more attention. In PCA+LDA, the LDA stage is preceded by an intermediate dimension reduction stage using Principal Component Analysis (PCA). Most previous LDA extensions are computationally expensive, and not scalable, due to the use of Singular Value Decomposition or Generalized Singular Value Decomposition. In this paper, we propose a two-stage LDA method, namely LDA/QR, which aims to overcome the singularity problems of classical LDA, while achieving efficiency and scalability simultaneously. The key difference between LDA/QR and PCA+LDA lies in the first stage, where LDA/QR applies QR decomposition to a small matrix involving the class centroids, while PCA+LDA applies PCA to the total scatter matrix involving all training data points. We further justify the proposed algorithm by showing the relationship among LDA/QR and previous LDA methods. Extensive experiments on face images and text documents are presented to show the effectiveness of the proposed algorithm.
Collapse
Affiliation(s)
- Jieping Ye
- Department of Computer Science and Engineering, University of Minnesota-Twin Cities, 4-192 EE/CSCI Bldg., 200 Union St. SE, Minneapolis, MN 55455, USA.
| | | |
Collapse
|
1030
|
Classification analysis of the transcriptosome of nonlesional cultured dermal fibroblasts from systemic sclerosis patients with early disease. ACTA ACUST UNITED AC 2005; 52:865-76. [PMID: 15751056 DOI: 10.1002/art.20871] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
OBJECTIVE To compare the transcriptosome of early-passage nonlesional dermal fibroblasts from systemic sclerosis (SSc) patients with diffuse disease and matched normal controls in order to gain further understanding of the gene activation patterns that occur in early disease. METHODS Total RNA was isolated from early-passage fibroblasts obtained from nonlesional skin biopsy specimens from 21 patients with diffuse SSc (disease duration <5 years in all but 1) and 18 healthy controls who were matched to the cases by age (+/-5 years), sex, and race. Array experiments were performed on a 16,659-oligonucleotide microarray utilizing a reference experimental design. Supervised methods were used to select differentially expressed genes. Quantitative polymerase chain reaction (PCR) was used to independently validate the array results. RESULTS Of the 8,324 genes that passed filtering criteria, classification analysis revealed that <5% were differentially expressed between SSc and normal fibroblasts. Individually, differentially expressed genes included COL7A1, COL18A1 (endostatin), DAF, COMP, and VEGFB. Using the panel of genes discovered through classification analysis, a set of model predictors that achieved reasonably high predictive accuracy was developed. Analysis of 1,297 gene ontology (GO) classes revealed 35 classes that were significantly dysregulated in SSc fibroblasts. These GO classes included anchoring collagen (30934), extracellular matrix structural constituent (5201), and complement activation (6958, 6956). Validation by quantitative PCR demonstrated that 7 of 7 genes selected were concordant with the array results. CONCLUSION Fibroblasts cultured from nonlesional skin of patients with SSc already have detectable abnormalities in a variety of genes and cellular processes, including those involved in extracellular matrix formation, fibrillogenesis, complement activation, and angiogenesis.
Collapse
|
1031
|
|
1032
|
Yoo C, Lee IB, Vanrolleghem PA. Interpreting patterns and analysis of acute leukemia gene expression data by multivariate fuzzy statistical analysis. Comput Chem Eng 2005. [DOI: 10.1016/j.compchemeng.2005.02.031] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
1033
|
Wessels LFA, Reinders MJT, Hart AAM, Veenman CJ, Dai H, He YD, van't Veer LJ. A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics 2005; 21:3755-62. [PMID: 15817694 DOI: 10.1093/bioinformatics/bti429] [Citation(s) in RCA: 106] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Microarray gene expression data are increasingly employed to identify sets of marker genes that accurately predict disease development and outcome in cancer. Many computational approaches have been proposed to construct such predictors. However, there is, as yet, no objective way to evaluate whether a new approach truly improves on the current state of the art. In addition no 'standard' computational approach has emerged which enables robust outcome prediction. RESULTS An important contribution of this work is the description of a principled training and validation protocol, which allows objective evaluation of the complete methodology for constructing a predictor. We review the possible choices of computational approaches, with specific emphasis on predictor choice and reporter selection strategies. Employing this training-validation protocol, we evaluated different reporter selection strategies and predictors on six gene expression datasets of varying degrees of difficulty. We demonstrate that simple reporter selection strategies (forward filtering and shrunken centroids) work surprisingly well and outperform partial least squares in four of the six datasets. Similarly, simple predictors, such as the nearest mean classifier, outperform more complex classifiers. Our training-validation protocol provides a robust methodology to evaluate the performance of new computational approaches and to objectively compare outcome predictions on different datasets.
Collapse
Affiliation(s)
- Lodewyk F A Wessels
- Department of Mediamatics, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands.
| | | | | | | | | | | | | |
Collapse
|
1034
|
Ghadimi BM, Grade M, Difilippantonio MJ, Varma S, Simon R, Montagna C, Füzesi L, Langer C, Becker H, Liersch T, Ried T. Effectiveness of gene expression profiling for response prediction of rectal adenocarcinomas to preoperative chemoradiotherapy. J Clin Oncol 2005; 23:1826-38. [PMID: 15774776 PMCID: PMC4721601 DOI: 10.1200/jco.2005.00.406] [Citation(s) in RCA: 260] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
PURPOSE There is a wide spectrum of tumor responsiveness of rectal adenocarcinomas to preoperative chemoradiotherapy ranging from complete response to complete resistance. This study aimed to investigate whether parallel gene expression profiling of the primary tumor can contribute to stratification of patients into groups of responders or nonresponders. PATIENTS AND METHODS Pretherapeutic biopsies from 30 locally advanced rectal carcinomas were analyzed for gene expression signatures using microarrays. All patients were participants of a phase III clinical trial (CAO/ARO/AIO-94, German Rectal Cancer Trial) and were randomized to receive a preoperative combined-modality therapy including fluorouracil and radiation. Class comparison was used to identify a set of genes that were differentially expressed between responders and nonresponders as measured by T level downsizing and histopathologic tumor regression grading. RESULTS In an initial set of 23 patients, responders and nonresponders showed significantly different expression levels for 54 genes (P < .001). The ability to predict response to therapy using gene expression profiles was rigorously evaluated using leave-one-out cross-validation. Tumor behavior was correctly predicted in 83% of patients (P = .02). Sensitivity (correct prediction of response) was 78%, and specificity (correct prediction of nonresponse) was 86%, with a positive and negative predictive value of 78% and 86%, respectively. CONCLUSION Our results suggest that pretherapeutic gene expression profiling may assist in response prediction of rectal adenocarcinomas to preoperative chemoradiotherapy. The implementation of gene expression profiles for treatment stratification and clinical management of cancer patients requires validation in large, independent studies, which are now warranted.
Collapse
Affiliation(s)
- B Michael Ghadimi
- Genetics Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg 50, Rm 1408, 50 South Dr, Bethesda, MD 20892-8010, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1035
|
Au WH, Chan KCC, Wong AKC, Wang Y. Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:83-101. [PMID: 17044174 DOI: 10.1109/tcbb.2005.17] [Citation(s) in RCA: 38] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis. By clustering attributes, the search dimension of a data mining algorithm is reduced. The reduction of search dimension is especially important to data mining in gene expression data because such data typically consist of a huge number of genes (attributes) and a small number of gene expression profiles (tuples). Most data mining algorithms are typically developed and optimized to scale to the number of tuples instead of the number of attributes. The situation becomes even worse when the number of attributes overwhelms the number of tuples, in which case, the likelihood of reporting patterns that are actually irrelevant due to chances becomes rather high. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This paper defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method groups interdependent attributes into clusters by optimizing a criterion function derived from an information measure that reflects the interdependence between attributes. By applying our algorithm to gene expression data, meaningful clusters of genes are discovered. The grouping of genes based on attribute interdependence within group helps to capture different aspects of gene association patterns in each group. Significant genes selected from each group then contain useful information for gene expression classification and identification. To evaluate the performance of the proposed approach, we applied it to two well-known gene expression data sets and compared our results with those obtained by other methods. Our experiments show that the proposed method is able to find the meaningful clusters of genes. By selecting a subset of genes which have high multiple-interdependence with others within clusters, significant classification information can be obtained. Thus, a small pool of selected genes can be used to build classifiers with very high classification rate. From the pool, gene expressions of different categories can be identified.
Collapse
Affiliation(s)
- Wai-Ho Au
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| | | | | | | |
Collapse
|
1036
|
Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal 2005. [DOI: 10.1016/j.csda.2004.03.017] [Citation(s) in RCA: 283] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
1037
|
Mallick BK, Ghosh D, Ghosh M. Bayesian classification of tumours by using gene expression data. J R Stat Soc Series B Stat Methodol 2005. [DOI: 10.1111/j.1467-9868.2005.00498.x] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
1038
|
Abstract
An attractive application of expression technologies is to predict drug efficacy or safety using expression data of biomarkers. To evaluate the performance of various classification methods for building predictive models, we applied these methods on six expression datasets. These datasets were from studies using microarray technologies and had either two or more classes. From each of the original datasets, two subsets were generated to simulate two scenarios in biomarker applications. First, a 50-gene subset was used to simulate a candidate gene approach when it might not be practical to measure a large number of genes/biomarkers. Next, a 2000-gene subset was used to simulate a whole genome approach. We evaluated the relative performance of several classification methods by using leave-one-out cross-validation and bootstrap cross-validation. Although all methods perform well in both subsets for a relative easy dataset with two classes, differences in performance do exist among methods for other datasets. Overall, partial least squares discriminant analysis (PLS-DA) and support vector machines (SVM) outperform all other methods. We suggest a practical approach to take advantage of multiple methods in biomarker applications.
Collapse
Affiliation(s)
- Michael Z Man
- Nonclinical Statistics, Pfizer Global Research and Development - Ann Arbor Laboratories, Ann Arbor, MI 48105, USA.
| | | | | | | |
Collapse
|
1039
|
Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis. J Biomed Inform 2005; 37:240-8. [PMID: 15465477 DOI: 10.1016/j.jbi.2004.07.003] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2004] [Indexed: 11/22/2022]
Abstract
Classification analysis of microarray gene expression data has been widely used to uncover biological features and to distinguish closely related cell types that often appear in the diagnosis of cancer. However, the number of dimensions of gene expression data is often very high, e.g., in the hundreds or thousands. Accurate and efficient classification of such high-dimensional data remains a contemporary challenge. In this paper, we propose a comprehensive vertical sample-based KNN/LSVM classification approach with weights optimized by genetic algorithms for high-dimensional data. Experiments on common gene expression datasets demonstrated that our approach can achieve high accuracy and efficiency at the same time. The improvement of speed is mainly related to the vertical data representation, P-tree,Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#:K96130308. and its optimized logical algebra. The high accuracy is due to the combination of a KNN majority voting approach and a local support vector machine approach that makes optimal decisions at the local level. As a result, our approach could be a powerful tool for high-dimensional gene expression data analysis.
Collapse
|
1040
|
Bailey WJ, Ulrich R. Molecular profiling approaches for identifying novel biomarkers. Expert Opin Drug Saf 2005. [DOI: 10.1517/14740338.3.2.137] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
1041
|
Liu Y, Shen X, Doss H. Multicategory ψ-Learning and Support Vector Machine: Computational Tools. J Comput Graph Stat 2005. [DOI: 10.1198/106186005x37238] [Citation(s) in RCA: 78] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
1042
|
Lotze MT, Wang E, Marincola FM, Hanna N, Bugelski PJ, Burns CA, Coukos G, Damle N, Godfrey TE, Howell WM, Panelli MC, Perricone MA, Petricoin EF, Sauter G, Scheibenbogen C, Shivers SC, Taylor DL, Weinstein JN, Whiteside TL. Workshop on Cancer Biometrics: Identifying Biomarkers and Surrogates of Cancer in Patients. J Immunother 2005; 28:79-119. [PMID: 15725954 DOI: 10.1097/01.cji.0000154251.20125.2e] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
The current excitement about molecular targeted therapies has driven much of the recent dialog in cancer diagnosis and treatment. Particularly in the biologic therapy of cancer, identifiable antigenic T-cell targets restricted by MHC molecules and the related novel stress molecules such as MICA/B and Letal allow a degree of precision previously unknown in cancer therapy. We have previously held workshops on immunologic monitoring and angiogenesis monitoring. This workshop was designed to discuss the state of the art in identification of biomarkers and surrogates of tumor in patients with cancer, with particular emphasis on assays within the blood and tumor. We distinguish this from immunologic monitoring in the sense that it is primarily a measure of the tumor burden as opposed to the immune response to it. Recommendations for intensive investigation and targeted funding to enable such strategies were developed in seven areas: genomic analysis; detection of molecular markers in peripheral blood and lymph node by tumor capture and RT-PCR; serum, plasma, and tumor proteomics; immune polymorphisms; high content screening using flow and imaging cytometry; immunohistochemistry and tissue microarrays; and assessment of immune infiltrate and necrosis in tumors. Concrete recommendations for current application and enabling further development in cancer biometrics are summarized. This will allow a more informed, rapid, and accurate assessment of novel cancer therapies.
Collapse
Affiliation(s)
- Michael T Lotze
- Translational Research, University of Pittsburgh Molecular Medicine Institute, Pittsburgh, Pennsylvania, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1043
|
Tsafrir D, Tsafrir I, Ein-Dor L, Zuk O, Notterman DA, Domany E. Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices. Bioinformatics 2005; 21:2301-8. [PMID: 15722375 DOI: 10.1093/bioinformatics/bti329] [Citation(s) in RCA: 111] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
SUMMARY We introduce a novel unsupervised approach for the organization and visualization of multidimensional data. At the heart of the method is a presentation of the full pairwise distance matrix of the data points, viewed in pseudocolor. The ordering of points is iteratively permuted in search of a linear ordering, which can be used to study embedded shapes. Several examples indicate how the shapes of certain structures in the data (elongated, circular and compact) manifest themselves visually in our permuted distance matrix. It is important to identify the elongated objects since they are often associated with a set of hidden variables, underlying continuous variation in the data. The problem of determining an optimal linear ordering is shown to be NP-Complete, and therefore an iterative search algorithm with O(n3) step-complexity is suggested. By using sorting points into neighborhoods, i.e. SPIN to analyze colon cancer expression data we were able to address the serious problem of sample heterogeneity, which hinders identification of metastasis related genes in our data. Our methodology brings to light the continuous variation of heterogeneity--starting with homogeneous tumor samples and gradually increasing the amount of another tissue. Ordering the samples according to their degree of contamination by unrelated tissue allows the separation of genes associated with irrelevant contamination from those related to cancer progression. AVAILABILITY Software package will be available for academic users upon request.
Collapse
Affiliation(s)
- D Tsafrir
- Department of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel.
| | | | | | | | | | | |
Collapse
|
1044
|
Yeung KY, Bumgarner RE, Raftery AE. Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 2005; 21:2394-402. [PMID: 15713736 DOI: 10.1093/bioinformatics/bti319] [Citation(s) in RCA: 131] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Selecting a small number of relevant genes for accurate classification of samples is essential for the development of diagnostic tests. We present the Bayesian model averaging (BMA) method for gene selection and classification of microarray data. Typical gene selection and classification procedures ignore model uncertainty and use a single set of relevant genes (model) to predict the class. BMA accounts for the uncertainty about the best set to choose by averaging over multiple models (sets of potentially overlapping relevant genes). RESULTS We have shown that BMA selects smaller numbers of relevant genes (compared with other methods) and achieves a high prediction accuracy on three microarray datasets. Our BMA algorithm is applicable to microarray datasets with any number of classes, and outputs posterior probabilities for the selected genes and models. Our selected models typically consist of only a few genes. The combination of high accuracy, small numbers of genes and posterior probabilities for the predictions should make BMA a powerful tool for developing diagnostics from expression data. AVAILABILITY The source codes and datasets used are available from our Supplementary website.
Collapse
Affiliation(s)
- Ka Yee Yeung
- Department of Microbiology, University of Washington, Seattle, WA 98195, USA.
| | | | | |
Collapse
|
1045
|
Feng Z, Prentice R, Srivastava S. Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. Pharmacogenomics 2005; 5:709-19. [PMID: 15335291 DOI: 10.1517/14622416.5.6.709] [Citation(s) in RCA: 95] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
The development and validation of clinically useful biomarkers from high-dimensional genomic and proteomic information pose great research challenges. Present bottlenecks include: that few of the biomarkers showing promise in initial discovery were found to warrant subsequent validation; and biomarker validation is expensive and time consuming. Biomarker evaluation should proceed in an orderly fashion to enhance rigor and efficiency. A molecular profiling approach, although promising, has a high chance of yielding biased results and overfitted models. Specimens from cohorts or intervention trials are essential to eliminate biases. The high cost for biomarker validation motivates some novel study design features, including sequential filtering and DNA pooling. For data analysis, logistic regression (in particular, boosting logistic regression) has features of robustness against model misspecification, and has resistance to model overfitting. Model assessment and cross-validation are critical components of data analysis. Having an independent test set is a vital feature of study design.
Collapse
Affiliation(s)
- Ziding Feng
- Division of Public Health Science, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.
| | | | | |
Collapse
|
1046
|
Abstract
BACKGROUND General studies of microarray gene-expression profiling have been undertaken to predict cancer outcome. Knowledge of this gene-expression profile or molecular signature should improve treatment of patients by allowing treatment to be tailored to the severity of the disease. We reanalysed data from the seven largest published studies that have attempted to predict prognosis of cancer patients on the basis of DNA microarray analysis. METHODS The standard strategy is to identify a molecular signature (ie, the subset of genes most differentially expressed in patients with different outcomes) in a training set of patients and to estimate the proportion of misclassifications with this signature on an independent validation set of patients. We expanded this strategy (based on unique training and validation sets) by using multiple random sets, to study the stability of the molecular signature and the proportion of misclassifications. FINDINGS The list of genes identified as predictors of prognosis was highly unstable; molecular signatures strongly depended on the selection of patients in the training sets. For all but one study, the proportion misclassified decreased as the number of patients in the training set increased. Because of inadequate validation, our chosen studies published overoptimistic results compared with those from our own analyses. Five of the seven studies did not classify patients better than chance. INTERPRETATION The prognostic value of published microarray results in cancer studies should be considered with caution. We advocate the use of validation by repeated random sampling.
Collapse
Affiliation(s)
- Stefan Michiels
- Biostatistics and Epidemiology Unit, Institut Gustave Roussy, Villejuif, France
| | | | | |
Collapse
|
1047
|
Fu WJ, Carroll RJ, Wang S. Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics 2005; 21:1979-86. [PMID: 15691862 DOI: 10.1093/bioinformatics/bti294] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Estimation of misclassification error has received increasing attention in clinical diagnosis and bioinformatics studies, especially in small sample studies with microarray data. Current error estimation methods are not satisfactory because they either have large variability (such as leave-one-out cross-validation) or large bias (such as resubstitution and leave-one-out bootstrap). While small sample size remains one of the key features of costly clinical investigations or of microarray studies that have limited resources in funding, time and tissue materials, accurate and easy-to-implement error estimation methods for small samples are desirable and will be beneficial. RESULTS A bootstrap cross-validation method is studied. It achieves accurate error estimation through a simple procedure with bootstrap resampling and only costs computer CPU time. Simulation studies and applications to microarray data demonstrate that it performs consistently better than its competitors. This method possesses several attractive properties: (1) it is implemented through a simple procedure; (2) it performs well for small samples with sample size, as small as 16; (3) it is not restricted to any particular classification rules and thus applies to many parametric or non-parametric methods.
Collapse
Affiliation(s)
- Wenjiang J Fu
- Department of Statistics, Texas A & M University, College Station, 77843, USA.
| | | | | |
Collapse
|
1048
|
|
1049
|
Lottaz C, Spang R. Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics 2005; 21:1971-8. [PMID: 15677704 DOI: 10.1093/bioinformatics/bti292] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Today, the characterization of clinical phenotypes by gene-expression patterns is widely used in clinical research. If the investigated phenotype is complex from the molecular point of view, new challenges arise and these have not been addressed systematically. For instance, the same clinical phenotype can be caused by various molecular disorders, such that one observes different characteristic expression patterns in different patients. RESULTS In this paper we describe a novel algorithm called Structured Analysis of Microarrays (StAM), which accounts for molecular heterogeneity of complex clinical phenotypes. Our algorithm goes beyond established methodology in several aspects: in addition to the expression data, it exploits functional annotations from the Gene Ontology database to build biologically focussed classifiers. These are used to uncover potential molecular disease subentities and associate them to biological processes without compromising overall prediction accuracy. AVAILABILITY Bioconductor compliant R package SUPPLEMENTARY INFORMATION Complete analyses are available at http://compdiag.molgen.mpg.de/supplements/lottaz05.
Collapse
Affiliation(s)
- Claudio Lottaz
- Max Planck Institute for Molecular Genetics and Berlin Center for Genome Based Bioinformatics, Germany.
| | | |
Collapse
|
1050
|
Campbell G. Some statistical and regulatory issues in the evaluation of genetic and genomic tests. J Biopharm Stat 2005; 14:539-52. [PMID: 15468751 DOI: 10.1081/bip-200025645] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
The genomics revolution is reverberating throughout the worlds of pharmaceutical drugs, genetic testing and statistical science. This revolution, which uses single nucleotide polymorphisms (SNPs) and gene expression technology, including cDNA and oligonucleotide microarrays, for a range of tests from home-brews to high-complexity lab kits, can allow the selection or exclusion of patients for therapy (responders or poor metabolizers). The wide variety of US regulatory mechanisms for these tests is discussed. Clinical studies to evaluate the performance of such tests need to follow statistical principles for sound diagnostic test design. Statistical methodology to evaluate such studies can be wide ranging, including receiver operating characteristic (ROC) methodology, logistic regression, discriminant analysis, multiple comparison procedures resampling, Bayesian hierarchical modeling, recursive partitioning, as well as exploratory techniques such as data mining. Recent examples of approved genetic tests are discussed.
Collapse
Affiliation(s)
- Gregory Campbell
- Division of Biostatistics, Center for Devices and Radiological Health, US Food and Drug Administration, Rockville, Maryland 20850, USA.
| |
Collapse
|