1101
|
Page GP, Edwards JW, Barnes S, Weindruch R, Allison DB. A design and statistical perspective on microarray gene expression studies in nutrition: the need for playful creativity and scientific hard-mindedness. Nutrition 2003; 19:997-1000. [PMID: 14624952 DOI: 10.1016/j.nut.2003.08.001] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
OBJECTIVES Our purpose is to highlight some of the past and potential future uses of microarray in nutrition research, while also commenting on some aspects of the design conduct and analysis of microarray data that will leave to improved data quality. METHODS In this review article we outline some of the aspects of microarray experimentation that must be considered before and during these experiments. These topics include: identification of the experiment's objective (hypothesis), the experimental design, sample size, statistical analysis, data verification, data handling, and experimental interpretation. RESULTS In order to illustrate the principles we outline in this article we use the methods to layout the design of a microarray experiment to study one aspect of the observation that a diet high in soy is associated with lower rates of breast cancer. CONCLUSIONS Microarrays are a very powerful tool for studying virtually every nutrition-related disease and trait and can provide valuable insights that are not obtainable with other techniques. However, unless nutrition researchers conduct their studies with scientific hard-mindedness, the studies will be of lower power at least if not completely misleading.
Collapse
Affiliation(s)
- Grier P Page
- Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, Alabama 35294-0022, USA.
| | | | | | | | | |
Collapse
|
1102
|
Michailidis G, Shedden K. The Application of Rule-Based Methods to Class Prediction Problems in Genomics. J Comput Biol 2003; 10:689-98. [PMID: 14633393 DOI: 10.1089/106652703322539033] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We propose a method for constructing classifiers using logical combinations of elementary rules. The method is a form of rule-based classification, which has been widely discussed in the literature. In this work we focus specifically on issues that arise in the context of classifying cell samples based on RNA or protein expression measurements. The basic idea is to specify elementary rules that exhibit a locally strong pattern in favor of a single class. Strict admissibility criteria are imposed to produce a manageable universe of elementary rules. Then the elementary rules are combined using a set covering algorithm to form a composite rule that achieves a perfect fit to the training data. The user has explicit control over a parameter that determines the composite rule's level of redundancy and parsimony. This built-in control, along with the simplicity of interpreting the rules, makes the method particularly useful for classification problems in genomics. We demonstrate the new method using several microarray datasets and examine its generalization performance. We also draw comparisons to other machine-learning strategies such as CART, ID3, and C4.5.
Collapse
Affiliation(s)
- George Michailidis
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109-1027, USA
| | | |
Collapse
|
1103
|
Wright G, Tan B, Rosenwald A, Hurt EH, Wiestner A, Staudt LM. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci U S A 2003; 100:9991-6. [PMID: 12900505 PMCID: PMC187912 DOI: 10.1073/pnas.1732008100] [Citation(s) in RCA: 756] [Impact Index Per Article: 34.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
To classify cancer specimens by their gene expression profiles, we created a statistical method based on Bayes' rule that estimates the probability of membership in one of two cancer subgroups. We used this method to classify diffuse large B cell lymphoma (DLBCL) biopsy samples into two gene expression subgroups based on data obtained from spotted cDNA microarrays. The germinal center B cell-like (GCB) DLBCL subgroup expressed genes characteristic of normal germinal center B cells whereas the activated B cell-like (ABC) DLBCL subgroup expressed a subset of the genes that are characteristic of plasma cells, particularly those encoding endoplasmic reticulum and golgi proteins involved in secretion. We next used this predictor to discover these subgroups within a second set of DLBCL biopsies that had been profiled by using oligonucleotide microarrays [Shipp, M. A., et al. (2002) Nat. Med. 8, 68-74]. The GCB and ABC DLBCL subgroups identified in this data set had significantly different 5-yr survival rates after multiagent chemotherapy (62% vs. 26%; P < or = 0.0051), in accord with analyses of other DLBCL cohorts. These results demonstrate the ability of this gene expression-based predictor to classify DLBCLs into biologically and clinically distinct subgroups irrespective of the method used to measure gene expression.
Collapse
Affiliation(s)
- George Wright
- Biometric Research Branch, Division of Cancer Treatment and Diagnosis, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | | | | | | | | | | |
Collapse
|
1104
|
|
1105
|
Morrison DA, Ellis JT. The design and analysis of microarray experiments: applications in parasitology. DNA Cell Biol 2003; 22:357-94. [PMID: 12906732 DOI: 10.1089/104454903767650658] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Microarray experiments can generate enormous amounts of data, but large datasets are usually inherently complex, and the relevant information they contain can be difficult to extract. For the practicing biologist, we provide an overview of what we believe to be the most important issues that need to be addressed when dealing with microarray data. In a microarray experiment we are simply trying to identify which genes are the most "interesting" in terms of our experimental question, and these will usually be those that are either overexpressed or underexpressed (upregulated or downregulated) under the experimental conditions. Analysis of the data to find these genes involves first preprocessing of the raw data for quality control, including filtering of the data (e.g., detection of outlying values) followed by standardization of the data (i.e., making the data uniformly comparable throughout the dataset). This is followed by the formal quantitative analysis of the data, which will involve either statistical hypothesis testing or multivariate pattern recognition. Statistical hypothesis testing is the usual approach to "class comparison," where several experimental groups are being directly compared. The best approach to this problem is to use analysis of variance, although issues related to multiple hypothesis testing and probability estimation still need to be evaluated. Pattern recognition can involve "class prediction," for which a range of supervised multivariate techniques are available, or "class discovery," for which an even broader range of unsupervised multivariate techniques have been developed. Each technique has its own limitations, which need to be kept in mind when making a choice from among them. To put these ideas in context, we provide a detailed examination of two specific examples of the analysis of microarray data, both from parasitology, covering many of the most important points raised.
Collapse
Affiliation(s)
- David A Morrison
- Department of Parasitology (SWEPAR), National Veterinary Institute and Swedish University of Agricultural Sciences, Uppsala, Sweden
| | | |
Collapse
|
1106
|
Spang R. Diagnostic signatures from microarrays: a bioinformatics concept for personalized medicine. ACTA ACUST UNITED AC 2003. [DOI: 10.1016/s1478-5382(03)02329-1] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
1107
|
Baker SG. The central role of receiver operating characteristic (ROC) curves in evaluating tests for the early detection of cancer. J Natl Cancer Inst 2003; 95:511-5. [PMID: 12671018 DOI: 10.1093/jnci/95.7.511] [Citation(s) in RCA: 116] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Stuart G Baker
- Biometry Research Group, National Cancer Institute, National Intitutes of Health, Bethesda, MD 20892-7354, USA.
| |
Collapse
|
1108
|
Pepe MS, Longton G, Anderson GL, Schummer M. Selecting differentially expressed genes from microarray experiments. Biometrics 2003; 59:133-42. [PMID: 12762450 DOI: 10.1111/1541-0420.00016] [Citation(s) in RCA: 151] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
High throughput technologies, such as gene expression arrays and protein mass spectrometry, allow one to simultaneously evaluate thousands of potential biomarkers that could distinguish different tissue types. Of particular interest here is distinguishing between cancerous and normal organ tissues. We consider statistical methods to rank genes (or proteins) in regards to differential expression between tissues. Various statistical measures are considered, and we argue that two measures related to the Receiver Operating Characteristic Curve are particularly suitable for this purpose. We also propose that sampling variability in the gene rankings be quantified, and suggest using the "selection probability function," the probability distribution of rankings for each gene. This is estimated via the bootstrap. A real dataset, derived from gene expression arrays of 23 normal and 30 ovarian cancer tissues, is analyzed. Simulation studies are also used to assess the relative performance of different statistical gene ranking measures and our quantification of sampling variability. Our approach leads naturally to a procedure for sample-size calculations, appropriate for exploratory studies that seek to identify differentially expressed genes.
Collapse
Affiliation(s)
- Margaret Sullivan Pepe
- Department of Biostatistics, University of Washington, Seattle, Washington 98195-7232, USA.
| | | | | | | |
Collapse
|
1109
|
Craig BA, Black MA, Doerge RW. Gene expression data: The technology and statistical analysis. JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS 2003. [DOI: 10.1198/1085711031256] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
1110
|
Abstract
Rapid advances in biotechnology have resulted in an increasing interest in the use of oligonucleotide and spotted cDNA gene expression microarrays for medical research. These arrays are being widely used to understand the underlying genetic structure of various diseases, with the ultimate goal to provide better diagnosis, prevention and cure. This technology allows for measurement of expression levels from several thousands of genes simultaneously, thus resulting in an enormous amount of data. The role of the statistician is critical to the successful design of gene expression studies, and the analysis and interpretation of the resulting voluminous data. This paper discusses hypotheses common to gene expression studies, and describes some of the statistical methods suitable for addressing these hypotheses. S-plus and SAS codes to perform the statistical methods are provided. Gene expression data from an unpublished oncologic study is used to illustrate these methods.
Collapse
Affiliation(s)
- Jaya M Satagopan
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, NY 10021, USA.
| | | |
Collapse
|
1111
|
Sebastiani P, Gussoni E, Kohane IS, Ramoni MF. Statistical Challenges in Functional Genomics. Stat Sci 2003. [DOI: 10.1214/ss/1056397486] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
1112
|
Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comput Biol 2003; 9:505-11. [PMID: 12162889 DOI: 10.1089/106652702760138592] [Citation(s) in RCA: 234] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We propose a general framework for prediction of predefined tumor classes using gene expression profiles from microarray experiments. The framework consists of 1) evaluating the appropriateness of class prediction for the given data set, 2) selecting the prediction method, 3) performing cross-validated class prediction, and 4) assessing the significance of prediction results by permutation testing. We describe an application of the prediction paradigm to gene expression profiles from human breast cancers, with specimens classified as positive or negative for BRCA1 mutations and also for BRCA2 mutations. In both cases, the accuracy of class prediction was statistically significant when compared to the accuracy of prediction expected by chance. The framework proposed here for the application of class prediction is designed to reduce the occurrence of spurious findings, a legitimate concern for high-dimensional microarray data. The prediction paradigm will serve as a good framework for comparing different prediction methods and may accelerate the development of molecular classifiers that are clinically useful.
Collapse
Affiliation(s)
- Michael D Radmacher
- Biometric Research Branch, National Cancer Institute, 6130 Executive Boulevard, Bethesda, MD 20892-7434, USA.
| | | | | |
Collapse
|
1113
|
Parmigiani G, Garrett ES, Irizarry RA, Zeger SL. The Analysis of Gene Expression Data: An Overview of Methods and Software. STATISTICS FOR BIOLOGY AND HEALTH 2003. [DOI: 10.1007/0-387-21679-0_1] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
1114
|
|
1115
|
|
1116
|
Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003; 95:14-8. [PMID: 12509396 DOI: 10.1093/jnci/95.1.14] [Citation(s) in RCA: 798] [Impact Index Per Article: 36.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Affiliation(s)
- Richard Simon
- Biometric Research Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | | | | | | |
Collapse
|
1117
|
Abstract
DNA microarrays make possible the rapid and comprehensive assessment of the transcriptional activity of a cell, and as such have proven valuable in assessing the molecular contributors to biological processes and in the classification of human cancers. The major challenge in using this technology is the analysis of its massive data output, which requires computational means for interpretation and a heightened need for quality data. The optimal analysis requires an accounting and control of the many sources of variance within the system, an understanding of the limitations of the statistical approaches, and the ability to make sense of the results through intelligent database interrogation.
Collapse
|
1118
|
Méndez MA, Hödar C, Vulpe C, González M, Cambiazo V. Discriminant analysis to evaluate clustering of gene expression data. FEBS Lett 2002; 522:24-8. [PMID: 12095613 DOI: 10.1016/s0014-5793(02)02873-9] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
In this work we present a procedure that combines classical statistical methods to assess the confidence of gene clusters identified by hierarchical clustering of expression data. This approach was applied to a publicly released Drosophila metamorphosis data set [White et al., Science 286 (1999) 2179-2184]. We have been able to produce reliable classifications of gene groups and genes within the groups by applying unsupervised (cluster analysis), dimension reduction (principal component analysis) and supervised methods (linear discriminant analysis) in a sequential form. This procedure provides a means to select relevant information from microarray data, reducing the number of genes and clusters that require further biological analysis.
Collapse
Affiliation(s)
- Marco A Méndez
- Laboratorio de Bioinformática y Expresión Génica, INTA, Universidad de Chile, Macul 5540, Santiago, Chile
| | | | | | | | | |
Collapse
|
1119
|
Abstract
DNA microarrays are assays that simultaneously provide information about expression levels of thousands of genes and are consequently finding wide use in biomedical research. In order to control the many sources of variation and the many opportunities for misanalysis, DNA microarray studies require careful planning. Different studies have different objectives, and important aspects of design and analysis strategy differ for different types of studies. We review several types of objectives of studies using DNA microarrays and address issues such as selection of samples, levels of replication needed, allocation of samples to dyes and arrays, sample size considerations, and analysis strategies.
Collapse
Affiliation(s)
- Richard Simon
- Biometric Research Branch, National Cancer Institute, Bethesda, Maryland 20892-7434, USA.
| | | | | |
Collapse
|
1120
|
Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A 2002; 99:6562-6. [PMID: 11983868 PMCID: PMC124442 DOI: 10.1073/pnas.102102699] [Citation(s) in RCA: 736] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2002] [Indexed: 11/18/2022] Open
Abstract
In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called .632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.
Collapse
Affiliation(s)
- Christophe Ambroise
- Laboratoire Heudiasyc, Unité Mixte de Recherche/Centre National de la Recherche Scientifique 6599, 60200 Compiègne, France
| | | |
Collapse
|
1121
|
Abstract
Pharmacogenomics is the application of genomic technologies to drug discovery and development, as well as for the elucidation of the mechanisms of drug action on cells and organisms. DNA microarrays measure genome-wide gene expression patterns and are an important tool for pharmacogenomic applications, such as the identification of molecular targets for drugs, toxicological studies and molecular diagnostics. Genome-wide investigations generate vast amounts of data and there is a need for computational methods to manage and analyze this information. Recently, several supervised methods, in which other information is utilized together with gene expression data, have been used to characterize genes and samples. The choice of analysis methods will influence the results and their interpretation, therefore it is important to be familiar with each method, its scope and limitations. Here, methods with special reference to applications for pharmacogenomics are reviewed.
Collapse
Affiliation(s)
- Markus Ringnér
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Building 50, Room 5142,50 South Drive MSC 8000, Bethesda, MD 20892, USA.
| | | | | |
Collapse
|