951
|
Malossini A, Blanzieri E, Ng RT. Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 2006; 22:2114-21. [PMID: 16820424 DOI: 10.1093/bioinformatics/btl346] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Classification is widely used in medical applications. However, the quality of the classifier depends critically on the accurate labeling of the training data. But for many medical applications, labeling a sample or grading a biopsy can be subjective. Existing studies confirm this phenomenon and show that even a very small number of mislabeled samples could deeply degrade the performance of the obtained classifier, particularly when the sample size is small. The problem we address in this paper is to develop a method for automatically detecting samples that are possibly mislabeled. RESULTS We propose two algorithms, a classification-stability algorithm and a leave-one-out-error-sensitivity algorithm for detecting possibly mislabeled samples. For both algorithms, the key structure is the computation of the leave-one-out perturbation matrix. The classification-stability algorithm is based on measuring the stability of the label of a sample with respect to label changes of other samples and the version of this algorithm based on the support vector machine appears to be quite accurate for three real datasets. The suspect list produced by the version is of high quality. Furthermore, when human intervention is not available, the correction heuristic appears to be beneficial.
Collapse
Affiliation(s)
- Andrea Malossini
- Department of Information and Communication Technology, University of Trento, 38050 Povo, Italy.
| | | | | |
Collapse
|
952
|
Lai C, Reinders MJ, Wessels L. Random subspace method for multivariate feature selection. Pattern Recognit Lett 2006. [DOI: 10.1016/j.patrec.2005.12.018] [Citation(s) in RCA: 86] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
953
|
Abul O, Alhajj R, Polat F. A powerful approach for effective finding of significantly differentially expressed genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2006; 3:220-31. [PMID: 17048460 DOI: 10.1109/tcbb.2006.29] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
The problem of identifying significantly differentially expressed genes for replicated microarray experiments is accepted as significant and has been tackled by several researchers. Patterns from Gene Expression (PaGE) and q-values are two of the well-known approaches developed to handle this problem. This paper proposes a powerful approach to handle this problem. We first propose a method for estimating the prior probabilities used in the first version of the PaGE algorithm. This way, the problem definition of PaGE stays intact and we just estimate the needed prior probabilities. Our estimation method is similar to Storey's estimator without being its direct extension. Then, we modify the problem formulation to find significantly differentially expressed genes and present an efficient method for finding them. This formulation increases the power by directly incorporating Storey's estimator. We report the preliminary results on the BRCA data set to demonstrate the applicability and effectiveness of our approach.
Collapse
Affiliation(s)
- Osman Abul
- Department of Computer Engineering, Middle East Technical University, Ankara, Turkey.
| | | | | |
Collapse
|
954
|
Breitling R. Biological microarray interpretation: The rules of engagement. ACTA ACUST UNITED AC 2006; 1759:319-27. [PMID: 16904203 DOI: 10.1016/j.bbaexp.2006.06.003] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2006] [Revised: 06/30/2006] [Accepted: 06/30/2006] [Indexed: 11/25/2022]
Abstract
Gene expression microarrays are now established as a standard tool in biological and biochemical laboratories. Interpreting the masses of data generated by this technology poses a number of unusual new challenges. Over the past few years a consensus has begun to emerge concerning the most important pitfalls and the proper ways to avoid them. This review provides an overview of these ideas, beginning with relevant aspects of experimental design and normalization, but focusing in particular on the various tools and concepts that help to interpret microarray results. These new approaches make it much easier to extract biologically relevant and reliable hypotheses in an objective and reasonably unbiased fashion.
Collapse
Affiliation(s)
- Rainer Breitling
- Groningen Bioinformatics Centre, University of Groningen, Kerklaan 30, 9751 NN Haren, The Netherlands.
| |
Collapse
|
955
|
Baker SG, Kramer BS, McIntosh M, Patterson BH, Shyr Y, Skates S. Evaluating markers for the early detection of cancer: overview of study designs and methods. Clin Trials 2006; 3:43-56. [PMID: 16539089 DOI: 10.1191/1740774506cn130oa] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
BACKGROUND The field of cancer biomarker development has been evolving rapidly. New developments both in the biologic and statistical realms are providing increasing opportunities for evaluation of markers for both early detection and diagnosis of cancer. PURPOSE To review the major conceptual and methodological issues in cancer biomarker evaluation, with an emphasis on recent developments in statistical methods together with practical recommendations. METHODS We organized this review by type of study: preliminary performance, retrospective performance, prospective performance and cancer screening evaluation. RESULTS For each type of study, we discuss methodologic issues, provide examples and discuss strengths and limitations. CONCLUSION Preliminary performance studies are useful for quickly winnowing down the number of candidate markers; however their results may not apply to the ultimate target population, asymptomatic subjects. If stored specimens from cohort studies with clinical cancer endpoints are available, retrospective studies provide a quick and valid way to evaluate performance of the markers or changes in the markers prior to the onset of clinical symptoms. Prospective studies have a restricted role because they require large sample sizes, and, if the endpoint is cancer on biopsy, there may be bias due to overdiagnosis. Cancer screening studies require very large sample sizes and long follow-up, but are necessary for evaluating the marker as a trigger of early intervention.
Collapse
|
956
|
Wang D, Lv Y, Guo Z, Li X, Li Y, Zhu J, Yang D, Xu J, Wang C, Rao S, Yang B. Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules. Bioinformatics 2006; 22:2883-9. [PMID: 16809389 DOI: 10.1093/bioinformatics/btl339] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION Microarrays datasets frequently contain a large number of missing values (MVs), which need to be estimated and replaced for subsequent data mining. The focus of the paper is to study the effects of different MV treatments for cDNA microarray data on disease classification analysis. RESULTS By analyzing five datasets, we demonstrate that among three kinds of classifiers evaluated in this study, support vector machine (SVM) classifiers are robust to varied MV imputation methods [e.g. replacing MVs by zero, K nearest-neighbor (KNN) imputation algorithm, local least square imputation and Bayesian principal component analysis], while the classification and regression tree classifiers are sensitive in terms of classification accuracy. The KNNclassifiers built on differentially expressed genes (DEGs) are robust to the varied MV treatments, but the performances of the KNN classifiers based on all measured genes can be significantly deteriorated when imputing MVs for genes with larger missing rate (MR) (e.g. MR > 5%). Generally, while replacing MVs by zero performs relatively poor, the other imputation algorithms have little difference in affecting classification performances of the SVM or KNN classifiers. We further demonstrate the power and feasibility of our recently proposed functional expression profile (FEP) approach as means to handle microarray data with MVs. The FEPs, which are derived from the functional modules that are enriched with sets of DEGs and thus can be consistently identified under varied MV treatments, achieve precise disease classification with better biological interpretation. We conclude that the choice of MV treatments should be determined in context of the later approaches used for disease classification. The suggested exclusion criterion of ignoring the genes with larger MR (e.g. >5%), while justifiable for some classifiers such as KNN classifiers, might not be considered as a general rule for all classifiers.
Collapse
Affiliation(s)
- Dong Wang
- Department of Bioinformatics and Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University Harbin 150086, China
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
957
|
Ooi CH, Chetty M, Teng SW. Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data. BMC Bioinformatics 2006; 7:320. [PMID: 16796748 PMCID: PMC1569877 DOI: 10.1186/1471-2105-7-320] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2005] [Accepted: 06/23/2006] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. RESULTS We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. CONCLUSION For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.
Collapse
Affiliation(s)
- Chia Huey Ooi
- Gippsland School of Information Technology, Monash University, Churchill, VIC 3842, Australia
| | - Madhu Chetty
- Gippsland School of Information Technology, Monash University, Churchill, VIC 3842, Australia
| | - Shyh Wei Teng
- Gippsland School of Information Technology, Monash University, Churchill, VIC 3842, Australia
| |
Collapse
|
958
|
Vinciotti V, Tucker A, Kellam P, Liu X. Robust Selection of Predictive Genes via a Simple Classifier. ACTA ACUST UNITED AC 2006; 5:1-11. [PMID: 16539532 DOI: 10.2165/00822942-200605010-00001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Identifying genes that direct the mechanism of a disease from expression data is extremely useful in understanding how that mechanism works. This in turn may lead to better diagnoses and potentially could lead to a cure for that disease. This task becomes extremely challenging when the data are characterised by only a small number of samples and a high number of dimensions, as is often the case with gene expression data. Motivated by this challenge, we present a general framework that focuses on simplicity and data perturbation. These are the keys for robust identification of the most predictive features in such data. Within this framework, we propose a simple selective naive Bayes classifier discovered using a global search technique, and combine it with data perturbation to increase its robustness for small sample sizes. An extensive validation of the method was carried out using two applied datasets from the field of microarrays and a simulated dataset, all confounded by small sample sizes and high dimensionality. The method has been shown to be capable of selecting genes known to be associated with prostate cancer and viral infections.
Collapse
Affiliation(s)
- Veronica Vinciotti
- School of Information Systems, Computing and Mathematics, Brunel University, Uxbridge, UK.
| | | | | | | |
Collapse
|
959
|
Xiong H, Chen XW. Kernel-based distance metric learning for microarray data classification. BMC Bioinformatics 2006; 7:299. [PMID: 16774678 PMCID: PMC1513256 DOI: 10.1186/1471-2105-7-299] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2005] [Accepted: 06/14/2006] [Indexed: 11/10/2022] Open
Abstract
Background The most fundamental task using gene expression data in clinical oncology is to classify tissue samples according to their gene expression levels. Compared with traditional pattern classifications, gene expression-based data classification is typically characterized by high dimensionality and small sample size, which make the task quite challenging. Results In this paper, we present a modified K-nearest-neighbor (KNN) scheme, which is based on learning an adaptive distance metric in the data space, for cancer classification using microarray data. The distance metric, derived from the procedure of a data-dependent kernel optimization, can substantially increase the class separability of the data and, consequently, lead to a significant improvement in the performance of the KNN classifier. Intensive experiments show that the performance of the proposed kernel-based KNN scheme is competitive to those of some sophisticated classifiers such as support vector machines (SVMs) and the uncorrelated linear discriminant analysis (ULDA) in classifying the gene expression data. Conclusion A novel distance metric is developed and incorporated into the KNN scheme for cancer classification. This metric can substantially increase the class separability of the data in the feature space and, hence, lead to a significant improvement in the performance of the KNN classifier.
Collapse
Affiliation(s)
- Huilin Xiong
- Bioinformatics and Computational Life Sciences Laboratory, Department of Electrical Engineering and Computer Science, University of Kansas, 2335 Irving Hill Road, Lawrence, Kansas 66045, USA
| | - Xue-wen Chen
- Bioinformatics and Computational Life Sciences Laboratory, Department of Electrical Engineering and Computer Science, University of Kansas, 2335 Irving Hill Road, Lawrence, Kansas 66045, USA
- Kansas Masonic Cancer Research Institute, Kansas City, Kansas, USA
| |
Collapse
|
960
|
Abstract
Studies that include high-throughput data, such as gene expression data, raise unique issues with respect to study design and analysis. At the same time, they should be viewed through the lens (albeit a modified one) of standard scientific approach that involves such issues as specifying objectives (even if the study is mainly hypothesis generating or exploratory), a careful consideration of design, including sample size and replication, deciding whether to include technical replication in addition to biological replication, and ensuring that the methods of analysis are appropriate for the objective.
Collapse
Affiliation(s)
- Jennifer Shoemaker
- Duke University, Department of Biostatistics and Bioinformatics, 2424 Erwin Road, Hock Plaza, Suite 802, Durham, NC 27705 USA.
| |
Collapse
|
961
|
|
962
|
Abstract
Modern research in cancer has been revolutionized by the introduction of new high-throughput methodologies such as DNA microarrays. Keeping the pace with these technologies, the bioinformatics offer new solutions for data analysis and, what is more important, it permits to formulate a new class of hypothesis inspired in systems biology, more oriented to blocks of functionally-related genes. Although software implementations for this new methodologies is new there are some options already available. Bioinformatic solutions for other high-throughput techniques such as array-CGH of large-scale genotyping is also revised.
Collapse
Affiliation(s)
- Joaquín Dopazo
- Department of Bioinformatics, Centro de Investigación Príncipe Felipe, Valencia, Spain.
| |
Collapse
|
963
|
Huang DS, Zheng CH. Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics 2006; 22:1855-62. [PMID: 16709589 DOI: 10.1093/bioinformatics/btl190] [Citation(s) in RCA: 238] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
MOTIVATION Microarrays are capable of determining the expression levels of thousands of genes simultaneously. One important application of gene expression data is classification of samples into categories. In combination with classification methods, this technology can be useful to support clinical management decisions for individual patients, e.g. in oncology. Standard statistic methodologies in classification or prediction do not work well when the number of variables p (genes) far too exceeds the number of samples n. So, modification of existing statistical methodologies or development of new methodologies is needed for the analysis of microarray data. RESULTS This paper proposes a new method for tumor classification using gene expression data. In this method, we first employ independent component analysis to model the gene expression data, then apply optimal scoring algorithm to classify them. Further speaking, this approach can first make full use of the high-order statistical information contained in the gene expression data. Second, this approach also employs regularized regression models to handle the situation of large numbers of correlated predictor variables. Finally, the predictive models are developed for classifying tumors based on the entire gene expression profile. To show the validity of the proposed method, we apply it to classify four DNA microarray datasets involving various human normal and tumor tissue samples. The experimental results show that the method is efficient and feasible. AVAILABILITY Matlab scripts are available on request.
Collapse
Affiliation(s)
- De-Shuang Huang
- Intelligent Computing Lab, Institute of Intelligent Machines, Chinese Academy of Sciences PO Box 1130, Hefei, Anhui 230031, China.
| | | |
Collapse
|
964
|
Ma S, Song X, Huang J. Regularized binormal ROC method in disease classification using microarray data. BMC Bioinformatics 2006; 7:253. [PMID: 16684357 PMCID: PMC1513612 DOI: 10.1186/1471-2105-7-253] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2005] [Accepted: 05/09/2006] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease diagnosis and prognosis. Thus it is of interest to develop efficient statistical methods that can simultaneously identify important biomarkers from such high-throughput genomic data and construct appropriate classification rules. It is also of interest to develop methods for evaluation of classification performance and ranking of identified biomarkers. RESULTS The ROC (receiver operating characteristic) technique has been widely used in disease classification with low dimensional biomarkers. Compared with the empirical ROC approach, the binormal ROC is computationally more affordable and robust in small sample size cases. We propose using the binormal AUC (area under the ROC curve) as the objective function for two-sample classification, and the scaled threshold gradient directed regularization method for regularized estimation and biomarker selection. Tuning parameter selection is based on V-fold cross validation. We develop Monte Carlo based methods for evaluating the stability of individual biomarkers and overall prediction performance. Extensive simulation studies show that the proposed approach can generate parsimonious models with excellent classification and prediction performance, under most simulated scenarios including model mis-specification. Application of the method to two cancer studies shows that the identified genes are reasonably stable with satisfactory prediction performance and biologically sound implications. The overall classification performance is satisfactory, with small classification errors and large AUCs. CONCLUSION In comparison to existing methods, the proposed approach is computationally more affordable without losing the optimality possessed by the standard ROC method.
Collapse
Affiliation(s)
- Shuangge Ma
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| | - Xiao Song
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| | - Jian Huang
- Department of Statistics & Actuarial Science and Program in Public Health Genetics, University of Iowa, Iowa City, IA 52242, USA
| |
Collapse
|
965
|
Gulmann C, Sheehan KM, Kay EW, Liotta LA, Petricoin EF. Array-based proteomics: mapping of protein circuitries for diagnostics, prognostics, and therapy guidance in cancer. J Pathol 2006; 208:595-606. [PMID: 16518808 DOI: 10.1002/path.1958] [Citation(s) in RCA: 90] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
The human proteome, due to the enormity of post-translational permutations that result in large numbers of isoforms, is much more complex than the genome and alterations in cancer can occur in ways that are not predictable by translational analysis alone. Proteomic analysis therefore represents a more direct way of investigating disease at the individual patient level. Furthermore, since most novel therapeutic targets are proteins, proteomic analysis potentially has a central role in patient care. At the same time, it is becoming clear that mapping entire networks rather than individual markers may be necessary for robust diagnostics as well as tailoring of therapy. Consequently, there is a need for high-throughput multiplexed proteomic techniques, with the capability of scanning multiple cases and analysing large numbers of endpoints. New types of protein arrays combined with advanced bioinformatics are currently being used to identify molecular signatures of individual tumours based on protein pathways and signalling cascades. It is envisaged that analysing the cellular 'circuitry' of ongoing molecular networks will become a powerful clinical tool in patient management.
Collapse
Affiliation(s)
- C Gulmann
- NCI-FDA Clinical Proteomics Program, Laboratory of Pathology, National Cancer Institute, Bethesda, MD 20892, USA.
| | | | | | | | | |
Collapse
|
966
|
Knijnenburg TA, Reinders MJ, Wessels LF. Artifacts of Markov blanket filtering based on discretized features in small sample size applications. Pattern Recognit Lett 2006. [DOI: 10.1016/j.patrec.2005.10.019] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
967
|
Yang K, Cai Z, Li J, Lin G. A stable gene selection in microarray data analysis. BMC Bioinformatics 2006; 7:228. [PMID: 16643657 PMCID: PMC1524991 DOI: 10.1186/1471-2105-7-228] [Citation(s) in RCA: 74] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2005] [Accepted: 04/27/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Microarray data analysis is notorious for involving a huge number of genes compared to a relatively small number of samples. Gene selection is to detect the most significantly differentially expressed genes under different conditions, and it has been a central research focus. In general, a better gene selection method can improve the performance of classification significantly. One of the difficulties in gene selection is that the numbers of samples under different conditions vary a lot. RESULTS Two novel gene selection methods are proposed in this paper, which are not affected by the unbalanced sample class sizes and do not assume any explicit statistical model on the gene expression values. They were evaluated on eight publicly available microarray datasets, using leave-one-out cross-validation and 5-fold cross-validation. The performance is measured by the classification accuracies using the top ranked genes based on the training datasets. CONCLUSION The experimental results showed that the proposed gene selection methods are efficient, effective, and robust in identifying differentially expressed genes. Adopting the existing SVM-based and KNN-based classifiers, the selected genes by our proposed methods in general give more accurate classification results, typically when the sample class sizes in the training dataset are unbalanced.
Collapse
Affiliation(s)
- Kun Yang
- Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin 150001, China
| | - Zhipeng Cai
- Department of Computing Science, University of Alberta. Edmonton, Alberta T6G 2E8, Canada
| | - Jianzhong Li
- Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin 150001, China
| | - Guohui Lin
- Department of Computing Science, University of Alberta. Edmonton, Alberta T6G 2E8, Canada
| |
Collapse
|
968
|
Özdağ H, Teschendorff AE, Ahmed AA, Hyland SJ, Blenkiron C, Bobrow L, Veerakumarasivam A, Burtt G, Subkhankulova T, Arends MJ, Collins VP, Bowtell D, Kouzarides T, Brenton JD, Caldas C. Differential expression of selected histone modifier genes in human solid cancers. BMC Genomics 2006; 7:90. [PMID: 16638127 PMCID: PMC1475574 DOI: 10.1186/1471-2164-7-90] [Citation(s) in RCA: 188] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2005] [Accepted: 04/25/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Post-translational modification of histones resulting in chromatin remodelling plays a key role in the regulation of gene expression. Here we report characteristic patterns of expression of 12 members of 3 classes of chromatin modifier genes in 6 different cancer types: histone acetyltransferases (HATs)- EP300, CREBBP, and PCAF; histone deacetylases (HDACs)- HDAC1, HDAC2, HDAC4, HDAC5, HDAC7A, and SIRT1; and histone methyltransferases (HMTs)- SUV39H1and SUV39H2. Expression of each gene in 225 samples (135 primary tumours, 47 cancer cell lines, and 43 normal tissues) was analysedby QRT-PCR, normalized with 8 housekeeping genes, and given as a ratio by comparison with a universal reference RNA. RESULTS This involved a total of 13,000 PCR assays allowing for rigorous analysis by fitting a linear regression model to the data. Mutation analysis of HDAC1, HDAC2, SUV39H1, and SUV39H2 revealed only two out of 181 cancer samples (both cell lines) with significant coding-sequence alterations. Supervised analysis and Independent Component Analysis showed that expression of many of these genes was able to discriminate tumour samples from their normal counterparts. Clustering based on the normalized expression ratios of the 12 genes also showed that most samples were grouped according to tissue type. Using a linear discriminant classifier and internal cross-validation revealed that with as few as 5 of the 12 genes, SIRT1, CREBBP, HDAC7A, HDAC5 and PCAF, most samples were correctly assigned. CONCLUSION The expression patterns of HATs, HDACs, and HMTs suggest these genes are important in neoplastic transformation and have characteristic patterns of expression depending on tissue of origin, with implications for potential clinical application.
Collapse
Affiliation(s)
- Hilal Özdağ
- Cancer Genomics Program, Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Cambridge CB2 2XZ, UK
- Ankara University, Institute of Biotechnology, Beşevler 06500 Ankara, Turkey
| | - Andrew E Teschendorff
- Cancer Genomics Program, Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Cambridge CB2 2XZ, UK
| | - Ahmed Ashour Ahmed
- Cancer Genomics Program, Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Cambridge CB2 2XZ, UK
| | - Sarah J Hyland
- Cancer Genomics Program, Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Cambridge CB2 2XZ, UK
| | - Cherie Blenkiron
- Cancer Genomics Program, Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Cambridge CB2 2XZ, UK
- Cambridge NTRAC Centre, Cambridge, UK
| | - Linda Bobrow
- Molecular Histopathology, Pathology Department, Addenbrooke's Hospital, University of Cambridge Box 235, Level 3, Hills Road, Cambridge CB2 2QQ, UK
| | - Abhi Veerakumarasivam
- Cancer Genomics Program, Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Cambridge CB2 2XZ, UK
| | - Glynn Burtt
- Cancer Genomics Program, Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Cambridge CB2 2XZ, UK
| | - Tanya Subkhankulova
- Cancer Genomics Program, Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Cambridge CB2 2XZ, UK
| | - Mark J Arends
- Molecular Histopathology, Pathology Department, Addenbrooke's Hospital, University of Cambridge Box 235, Level 3, Hills Road, Cambridge CB2 2QQ, UK
| | - V Peter Collins
- Molecular Histopathology, Pathology Department, Addenbrooke's Hospital, University of Cambridge Box 235, Level 3, Hills Road, Cambridge CB2 2QQ, UK
| | - David Bowtell
- Ian Potter Centre for Cancer Genomics and Predictive Medicine, Peter MacCallum Cancer Centre, St. Andrew's Place, East Melbourne,Victoria 3002, Australia
| | - Tony Kouzarides
- Wellcome/Cancer Research UK Gurdon Institute and Department of Pathology, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK
| | - James D Brenton
- Cancer Genomics Program, Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Cambridge CB2 2XZ, UK
- Cambridge NTRAC Centre, Cambridge, UK
| | - Carlos Caldas
- Cancer Genomics Program, Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Cambridge CB2 2XZ, UK
- Cambridge NTRAC Centre, Cambridge, UK
| |
Collapse
|
969
|
Kustra R, Shioda R, Zhu M. A factor analysis model for functional genomics. BMC Bioinformatics 2006; 7:216. [PMID: 16630343 PMCID: PMC1468435 DOI: 10.1186/1471-2105-7-216] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2005] [Accepted: 04/21/2006] [Indexed: 11/10/2022] Open
Abstract
Background Expression array data are used to predict biological functions of uncharacterized genes by comparing their expression profiles to those of characterized genes. While biologically plausible, this is both statistically and computationally challenging. Typical approaches are computationally expensive and ignore correlations among expression profiles and functional categories. Results We propose a factor analysis model (FAM) for functional genomics and give a two-step algorithm, using genome-wide expression data for yeast and a subset of Gene-Ontology Biological Process functional annotations. We show that the predictive performance of our method is comparable to the current best approach while our total computation time was faster by a factor of 4000. We discuss the unique challenges in performance evaluation of algorithms used for genome-wide functions genomics. Finally, we discuss extensions to our method that can incorporate the inherent correlation structure of the functional categories to further improve predictive performance. Conclusion Our factor analysis model is a computationally efficient technique for functional genomics and provides a clear and unified statistical framework with potential for incorporating important gene ontology information to improve predictions.
Collapse
Affiliation(s)
- Rafal Kustra
- Public Health Sciences, Health Sciences Bldg, University of Toronto, Toronto, ON, Canada
| | - Romy Shioda
- Department of Combinatorics and Optimization, University of Waterloo, Waterloo, ON, Canada
| | - Mu Zhu
- Department of Statistics and Actuarial Science, Universityof Waterloo, Waterloo, ON, Canada
| |
Collapse
|
970
|
Abstract
In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).
Collapse
Affiliation(s)
- Amy V Kapp
- Department of Statistics, Stanford University, Stanford, CA 94305-4065, USA.
| | | |
Collapse
|
971
|
|
972
|
Abstract
We develop a new statistic for testing the equality of two multivariate mean vectors. A scaled chi-squared distribution is proposed as an approximating null distribution. Because the test statistic is based on componentwise statistics, it has the advantage over Hotelling's T2 test of being applicable to the case where the dimension of an observation exceeds the number of observations. An appealing feature of the new test is its ability to handle missing data by relying on only componentwise sample moments. Monte Carlo studies indicate good power compared to Hotelling's T2 and a recently proposed test by Srivastava (2004, Technical Report, University of Toronto). The test is applied to drug discovery data.
Collapse
Affiliation(s)
- Yujun Wu
- Department of Biostatistics, School of Public Health, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, USA.
| | | | | |
Collapse
|
973
|
Zhu H, Hu S, Jona G, Zhu X, Kreiswirth N, Willey BM, Mazzulli T, Liu G, Song Q, Chen P, Cameron M, Tyler A, Wang J, Wen J, Chen W, Compton S, Snyder M. Severe acute respiratory syndrome diagnostics using a coronavirus protein microarray. Proc Natl Acad Sci U S A 2006; 103:4011-6. [PMID: 16537477 PMCID: PMC1449637 DOI: 10.1073/pnas.0510921103] [Citation(s) in RCA: 100] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
To monitor severe acute respiratory syndrome (SARS) infection, a coronavirus protein microarray that harbors proteins from SARS coronavirus (SARS-CoV) and five additional coronaviruses was constructed. These microarrays were used to screen approximately 400 Canadian sera from the SARS outbreak, including samples from confirmed SARS-CoV cases, respiratory illness patients, and healthcare professionals. A computer algorithm that uses multiple classifiers to predict samples from SARS patients was developed and used to predict 206 sera from Chinese fever patients. The test assigned patients into two distinct groups: those with antibodies to SARS-CoV and those without. The microarray also identified patients with sera reactive against other coronavirus proteins. Our results correlated well with an indirect immunofluorescence test and demonstrated that viral infection can be monitored for many months after infection. We show that protein microarrays can serve as a rapid, sensitive, and simple tool for large-scale identification of viral-specific antibodies in sera.
Collapse
Affiliation(s)
- Heng Zhu
- Departments of *Molecular, Cellular, and Developmental Biology and
- Biochip Platform Division, Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 101300, China
| | - Shaohui Hu
- Biochip Platform Division, Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 101300, China
| | - Ghil Jona
- Departments of *Molecular, Cellular, and Developmental Biology and
| | - Xiaowei Zhu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520
| | - Nate Kreiswirth
- Department of Microbiology, Mount Sinai Hospital, Toronto, ON, Canada M5G 1X5; and
| | - Barbara M. Willey
- Department of Microbiology, Mount Sinai Hospital, Toronto, ON, Canada M5G 1X5; and
| | - Tony Mazzulli
- Department of Microbiology, Mount Sinai Hospital, Toronto, ON, Canada M5G 1X5; and
| | - Guozhen Liu
- Biochip Platform Division, Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 101300, China
- **College of Life Sciences, Agricultural University of Hebei, Hebei, Baoding 071001, China
| | - Qifeng Song
- Biochip Platform Division, Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 101300, China
| | - Peng Chen
- Biochip Platform Division, Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 101300, China
| | - Mark Cameron
- Department of Microbiology, Mount Sinai Hospital, Toronto, ON, Canada M5G 1X5; and
| | - Andrea Tyler
- Department of Microbiology, Mount Sinai Hospital, Toronto, ON, Canada M5G 1X5; and
| | - Jian Wang
- Biochip Platform Division, Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 101300, China
| | - Jie Wen
- Biochip Platform Division, Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 101300, China
| | - Weijun Chen
- Biochip Platform Division, Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 101300, China
| | | | - Michael Snyder
- Departments of *Molecular, Cellular, and Developmental Biology and
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
974
|
Chu F, Wang L. Applications of support vector machines to cancer classification with microarray data. Int J Neural Syst 2006; 15:475-84. [PMID: 16385636 DOI: 10.1142/s0129065705000396] [Citation(s) in RCA: 138] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Microarray gene expression data usually have a large number of dimensions, e.g., over ten thousand genes, and a small number of samples, e.g., a few tens of patients. In this paper, we use the support vector machine (SVM) for cancer classification with microarray data. Dimensionality reduction methods, such as principal components analysis (PCA), class-separability measure, Fisher ratio, and t-test, are used for gene selection. A voting scheme is then employed to do multi-group classification by k(k - 1) binary SVMs. We are able to obtain the same classification accuracy but with much fewer features compared to other published results.
Collapse
Affiliation(s)
- Feng Chu
- School of Electrical and Electronic Engineering, Nanyang Technological University, Block S1, Nanyang Avenue, Singapore 639798
| | | |
Collapse
|
975
|
Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, Robles V. Machine learning in bioinformatics. Brief Bioinform 2006; 7:86-112. [PMID: 16761367 DOI: 10.1093/bib/bbk007] [Citation(s) in RCA: 368] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This article reviews machine learning methods for bioinformatics. It presents modelling methods, such as supervised classification, clustering and probabilistic graphical models for knowledge discovery, as well as deterministic and stochastic heuristics for optimization. Applications in genomics, proteomics, systems biology, evolution and text mining are also shown.
Collapse
Affiliation(s)
- Pedro Larrañaga
- Intelligent Systems Group, Department of Computer Science and Artificial Intelligence, University of the Basque Country, Paseo Manuel de Lardizabal, 1, 20018 San Sebastian, Spain.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
976
|
Berrar D, Bradbury I, Dubitzky W. Avoiding model selection bias in small-sample genomic datasets. Bioinformatics 2006; 22:1245-50. [PMID: 16500931 DOI: 10.1093/bioinformatics/btl066] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Genomic datasets generated by high-throughput technologies are typically characterized by a moderate number of samples and a large number of measurements per sample. As a consequence, classification models are commonly compared based on resampling techniques. This investigation discusses the conceptual difficulties involved in comparative classification studies. Conclusions derived from such studies are often optimistically biased, because the apparent differences in performance are usually not controlled in a statistically stringent framework taking into account the adopted sampling strategy. We investigate this problem by means of a comparison of various classifiers in the context of multiclass microarray data. RESULTS Commonly used accuracy-based performance values, with or without confidence intervals, are inadequate for comparing classifiers for small-sample data. We present a statistical methodology that avoids bias in cross-validated model selection in the context of small-sample scenarios. This methodology is valid for both k-fold cross-validation and repeated random sampling.
Collapse
Affiliation(s)
- Daniel Berrar
- School of Biomedical Sciences, University of Ulster at Coleraine Northern Ireland.
| | | | | |
Collapse
|
977
|
Berrar D, Bradbury I, Dubitzky W. Instance-based concept learning from multiclass DNA microarray data. BMC Bioinformatics 2006; 7:73. [PMID: 16483361 PMCID: PMC1402330 DOI: 10.1186/1471-2105-7-73] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2005] [Accepted: 02/16/2006] [Indexed: 12/01/2022] Open
Abstract
Background Various statistical and machine learning methods have been successfully applied to the classification of DNA microarray data. Simple instance-based classifiers such as nearest neighbor (NN) approaches perform remarkably well in comparison to more complex models, and are currently experiencing a renaissance in the analysis of data sets from biology and biotechnology. While binary classification of microarray data has been extensively investigated, studies involving multiclass data are rare. The question remains open whether there exists a significant difference in performance between NN approaches and more complex multiclass methods. Comparative studies in this field commonly assess different models based on their classification accuracy only; however, this approach lacks the rigor needed to draw reliable conclusions and is inadequate for testing the null hypothesis of equal performance. Comparing novel classification models to existing approaches requires focusing on the significance of differences in performance. Results We investigated the performance of instance-based classifiers, including a NN classifier able to assign a degree of class membership to each sample. This model alleviates a major problem of conventional instance-based learners, namely the lack of confidence values for predictions. The model translates the distances to the nearest neighbors into 'confidence scores'; the higher the confidence score, the closer is the considered instance to a pre-defined class. We applied the models to three real gene expression data sets and compared them with state-of-the-art methods for classifying microarray data of multiple classes, assessing performance using a statistical significance test that took into account the data resampling strategy. Simple NN classifiers performed as well as, or significantly better than, their more intricate competitors. Conclusion Given its highly intuitive underlying principles – simplicity, ease-of-use, and robustness – the k-NN classifier complemented by a suitable distance-weighting regime constitutes an excellent alternative to more complex models for multiclass microarray data sets. Instance-based classifiers using weighted distances are not limited to microarray data sets, but are likely to perform competitively in classifications of high-dimensional biological data sets such as those generated by high-throughput mass spectrometry.
Collapse
Affiliation(s)
- Daniel Berrar
- School of Biomedical Sciences, University of Ulster at Coleraine, Cromore Road, Northern Ireland, UK
| | - Ian Bradbury
- School of Biomedical Sciences, University of Ulster at Coleraine, Cromore Road, Northern Ireland, UK
| | - Werner Dubitzky
- School of Biomedical Sciences, University of Ulster at Coleraine, Cromore Road, Northern Ireland, UK
| |
Collapse
|
978
|
Want EJ, Cravatt BF, Siuzdak G. The expanding role of mass spectrometry in metabolite profiling and characterization. Chembiochem 2006; 6:1941-51. [PMID: 16206229 DOI: 10.1002/cbic.200500151] [Citation(s) in RCA: 174] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Mass spectrometry has a strong history in drug-metabolite analysis and has recently emerged as the foremost technology in endogenous metabolite research. The advantages of mass spectrometry include a wide dynamic range, the ability to observe a diverse number of molecular species, and reproducible quantitative analysis. These attributes are important in addressing the issue of metabolite profiling, as the dynamic range easily exceeds nine orders of magnitude in biofluids, and the diversity of species ranges from simple amino acids to lipids to complex carbohydrates. The goals of the application of mass spectrometry range from basic biochemistry to clinical biomarker discovery with challenges in generating a comprehensive profile, data analysis, and structurally characterizing physiologically important metabolites. The precedent for this work has already been set in neonatal screening, as blood samples from millions of neonates are tested routinely by mass spectrometry as a diagnostic tool for inborn errors of metabolism. In this review, we will discuss the background from which contemporary metabolite research emerged, the techniques involved in this exciting area, and the current and future applications of this field.
Collapse
Affiliation(s)
- Elizabeth J Want
- Department of Molecular Biology and The Center for Mass Spectrometry, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, California 92037, USA
| | | | | |
Collapse
|
979
|
Yang H, Crawford N, Lukes L, Finney R, Lancaster M, Hunter. KW. Metastasis predictive signature profiles pre-exist in normal tissues. Clin Exp Metastasis 2006; 22:593-603. [PMID: 16475030 PMCID: PMC2048974 DOI: 10.1007/s10585-005-6244-6] [Citation(s) in RCA: 87] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2005] [Accepted: 12/29/2005] [Indexed: 11/25/2022]
Abstract
Previous studies from our laboratory have demonstrated that metastatic propensity is significantly influenced by the genetic background upon which tumors arise. We have also established that human gene expression profiles predictive of metastasis are not only present in mouse tumors with both high and low metastatic capacity, but also correlate with genetic background. These results suggest that human metastasis-predictive gene expression signatures may be significantly driven by genetic background, rather than acquired somatic mutations. To test this hypothesis, gene expression profiling was performed on inbred mouse strains with significantly different metastatic efficiencies. Analysis of previously described human metastasis signature gene expression patterns in normal tissues permitted accurate categorization of high or low metastatic mouse genotypes. Furthermore, prospective identification of animals at high risk of metastasis was achieved by using mass spectrometry to characterize salivary peptide polymorphisms in a genetically heterogeneous population. These results strongly support the role of constitutional genetic variation in modulation of metastatic efficiency and suggest that predictive signature profiles could be developed from normal tissues in humans. The ability to identify those individuals at high risk of disseminated disease at the time of clinical manifestation of a primary cancer could have a significant impact on cancer management.
Collapse
Affiliation(s)
| | | | | | | | | | - Kent W. Hunter.
- *Corresponding Author: Kent W. Hunter, Laboratory of Population Genetics, CCR/NCI/NIH, Bldg 41 Rm 702, 41 Library Drive, Bethesda, MD 20892, tel: 301-435-8957, fax: 301-435-8963,
| |
Collapse
|
980
|
Boulesteix AL, Tutz G. Identification of interaction patterns and classification with applications to microarray data. Comput Stat Data Anal 2006. [DOI: 10.1016/j.csda.2004.10.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
981
|
Shen L, Tan EC. Reducing multiclass cancer classification to binary by output coding and SVM. Comput Biol Chem 2006; 30:63-71. [PMID: 16321568 DOI: 10.1016/j.compbiolchem.2005.10.008] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2005] [Revised: 10/11/2005] [Accepted: 10/11/2005] [Indexed: 11/16/2022]
Abstract
Multiclass cancer classification based on microarray data is presented. The binary classifiers used combine support vector machines with a generalized output-coding scheme. Different coding strategies, decoding functions and feature selection methods are incorporated and validated on two cancer datasets: GCM and ALL. Using random coding strategy and recursive feature elimination, the testing accuracy achieved is as high as 83% on GCM data with 14 classes. Comparing with other classification methods, our method is superior in classificatory performance.
Collapse
Affiliation(s)
- Li Shen
- School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore.
| | | |
Collapse
|
982
|
Somiari SB, Shriver CD, Heckman C, Olsen C, Hu H, Jordan R, Arciero C, Russell S, Garguilo G, Hooke J, Somiari RI. Plasma concentration and activity of matrix metalloproteinase 2 and 9 in patients with breast disease, breast cancer and at risk of developing breast cancer. Cancer Lett 2006; 233:98-107. [PMID: 16473671 DOI: 10.1016/j.canlet.2005.03.003] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2004] [Revised: 09/15/2004] [Accepted: 03/04/2005] [Indexed: 10/25/2022]
Abstract
Matrix metalloproteinases (MMPs) are involved in extracellular matrix modification and associated with invasive and metastatic behavior of human malignant tumors. Specifically, MMP2 and MMP9 are implicated in both early and late processes of tumor development. It is reported that MMPs occur as inactive precursors, active enzymes or enzyme inhibitor complexes in biological samples. However, there is limited knowledge on the role of each form in disease and/or the significance of changes in the plasma concentration and/or activity in breast cancer patients. The aim of this study was to determine if patients with breast cancer, benign disease and at risk for developing breast cancer display characteristic levels of active and/or total MMP2 and MMP9 in plasma. Concentration and activity of MMP2 and MMP9 were determined quantitatively in the plasma of 124 female volunteers diagnosed with breast cancer (n=31), benign disease (n=38), or determined by the Gail Model to be at high risk (n=31) or low risk (controls, n=24) of developing breast cancer. Data obtained was statistically analyzed to search for differences/patterns characteristic of each category. Concentration of total MMP2 was significantly lower in control individuals than benign, high risk (P<0.001 respectively) and breast cancer patients (P=0.002). Activity of total MMP2 was significantly lower in controls compared to cancer, benign and high risk patients (P<0.001 respectively). Attempts to build a predictive/descriptive model using canonical discriminant analysis (utilizing all eight features; concentrations and activity levels of active/total MMP2 and MMP9) enabled the distinction of the controls from the high risk, benign and cancer groups. Our results suggest that preoperative plasma concentration and activity of MMP2 and MMP9 may permit sub-classification of female patients with breast disorders.
Collapse
Affiliation(s)
- Stella B Somiari
- Clinical Breast Care Project, Windber Research Institute, 600 Somerset Avenue, Windber, PA 15963, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
983
|
Koo JY, Sohn I, Kim S, Lee JW. Structured polychotomous machine diagnosis of multiple cancer types using gene expression. Bioinformatics 2006; 22:950-8. [PMID: 16452113 DOI: 10.1093/bioinformatics/btl029] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The problem of class prediction has received a tremendous amount of attention in the literature recently. In the context of DNA microarrays, where the task is to classify and predict the diagnostic category of a sample on the basis of its gene expression profile, a problem of particular importance is the diagnosis of cancer type based on microarray data. One method of classification which has been very successful in cancer diagnosis is the support vector machine (SVM). The latter has been shown (through simulations) to be superior in comparison with other methods, such as classical discriminant analysis, however, SVM suffers from the drawback that the solution is implicit and therefore is difficult to interpret. In order to remedy this difficulty, an analysis of variance decomposition using structured kernels is proposed and is referred to as the structured polychotomous machine. This technique utilizes Newton-Raphson to find estimates of coefficients followed by the Rao and Wald tests, respectively, for addition and deletion of import vectors. RESULTS The proposed method is applied to microarray data and simulation data. The major breakthrough of our method is efficiency in that only a minimal number of genes that accurately predict the classes are selected. It has been verified that the selected genes serve as legitimate markers for cancer classification from a biological point of view. AVAILABILITY All source codes used are available on request from the authors.
Collapse
Affiliation(s)
- Ja-Yong Koo
- Department of Statistics, Korea University, Seoul 136-701, Korea.
| | | | | | | |
Collapse
|
984
|
Chakraborty S, Ghosh M, Maiti T, Tewari A. Bayesian neural networks for bivariate binary data: an application to prostate cancer study. Stat Med 2006; 24:3645-62. [PMID: 16138362 DOI: 10.1002/sim.2214] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Prostate cancer is one of the most common cancers in American men. The cancer could either be locally confined, or it could spread outside the organ. When locally confined, there are several options for treating and curing this disease. Otherwise, surgery is the only option, and in extreme cases of outside spread, it could very easily recur within a short time even after surgery and subsequent radiation therapy. Hence, it is important to know, based on pre-surgery biopsy results how likely the cancer is organ-confined or not. The paper considers a hierarchical Bayesian neural network approach for posterior prediction probabilities of certain features indicative of non-organ confined prostate cancer. In particular, we find such probabilities for margin positivity (MP) and seminal vesicle (SV) positivity jointly. The available training set consists of bivariate binary outcomes indicating the presence or absence of the two. In addition, we have certain covariates such as prostate specific antigen (PSA), gleason score and the indicator for the cancer to be unilateral or bilateral (i.e. spread on one or both sides) in one data set and gene expression microarrays in another data set. We take a hierarchical Bayesian neural network approach to find the posterior prediction probabilities for a test and validation set, and compare these with the actual outcomes for the first data set. In case of the microarray data we use leave one out cross-validation to access the accuracy of our method. We also demonstrate the superiority of our method to the other competing methods through a simulation study. The Bayesian procedure is implemented by an application of the Markov chain Monte Carlo numerical integration technique. For the problem at hand, our Bayesian bivariate neural network procedure is shown to be superior to the classical neural network, Radford Neal's Bayesian neural network as well as bivariate logistic models to predict jointly the MP and SV in a patient in both the data sets as well as in the simulation study.
Collapse
Affiliation(s)
- Sounak Chakraborty
- Department of Statistics, University of Florida, 103 Griffin/Floyd Hall, Gainesville, FL 32611-8545, USA.
| | | | | | | |
Collapse
|
985
|
Abstract
The genome era provides two sources of knowledge to investigators whose goal is to discover new cancer therapies: first, information on the 20,000 to 40,000 genes that comprise the human genome, the proteins they encode, and the variation in these genes and proteins in human populations that place individuals at risk or that occur in disease; second, genome-wide analysis of cancer cells and tissues leads to the identification of new drug targets and the design of new therapeutic interventions. Using genome resources requires the storage and analysis of large amounts of diverse information on genetic variation, gene and protein functions, and interactions in regulatory processes and biochemical pathways. Cancer bioinformatics deals with organizing and analyzing the data so that important trends and patterns can be identified. Specific gene and protein targets on which cancer cells depend can be identified. Therapeutic agents directed against these targets can then be developed and evaluated. Finally, molecular and genetic variation within a population may become the basis of individualized treatment.
Collapse
Affiliation(s)
- David W Mount
- Arizona Cancer Center, University of Arizona, 1515 North Campbell Avenue, P.O. Box 245024, Tucson, AZ 85724-5024, USA.
| | | |
Collapse
|
986
|
Tjaden B. An approach for clustering gene expression data with error information. BMC Bioinformatics 2006; 7:17. [PMID: 16409635 PMCID: PMC1360687 DOI: 10.1186/1471-2105-7-17] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2005] [Accepted: 01/12/2006] [Indexed: 11/22/2022] Open
Abstract
Background Clustering of gene expression patterns is a well-studied technique for elucidating trends across large numbers of transcripts and for identifying likely co-regulated genes. Even the best clustering methods, however, are unlikely to provide meaningful results if too much of the data is unreliable. With the maturation of microarray technology, a wealth of research on statistical analysis of gene expression data has encouraged researchers to consider error and uncertainty in their microarray experiments, so that experiments are being performed increasingly with repeat spots per gene per chip and with repeat experiments. One of the challenges is to incorporate the measurement error information into downstream analyses of gene expression data, such as traditional clustering techniques. Results In this study, a clustering approach is presented which incorporates both gene expression values and error information about the expression measurements. Using repeat expression measurements, the error of each gene expression measurement in each experiment condition is estimated, and this measurement error information is incorporated directly into the clustering algorithm. The algorithm, CORE (Clustering Of Repeat Expression data), is presented and its performance is validated using statistical measures. By using error information about gene expression measurements, the clustering approach is less sensitive to noise in the underlying data and it is able to achieve more accurate clusterings. Results are described for both synthetic expression data as well as real gene expression data from Escherichia coli and Saccharomyces cerevisiae. Conclusion The additional information provided by replicate gene expression measurements is a valuable asset in effective clustering. Gene expression profiles with high errors, as determined from repeat measurements, may be unreliable and may associate with different clusters, whereas gene expression profiles with low errors can be clustered with higher specificity. Results indicate that including error information from repeat gene expression measurements can lead to significant improvements in clustering accuracy.
Collapse
Affiliation(s)
- Brian Tjaden
- Computer Science Department, Wellesley College, Wellesley, MA 02481, USA.
| |
Collapse
|
987
|
Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006; 7:3. [PMID: 16398926 PMCID: PMC1363357 DOI: 10.1186/1471-2105-7-3] [Citation(s) in RCA: 1180] [Impact Index Per Article: 62.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2005] [Accepted: 01/06/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. RESULTS We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. CONCLUSION Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
Collapse
Affiliation(s)
- Ramón Díaz-Uriarte
- Bioinformatics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernandez Almagro 3, Madrid, 28029, Spain
| | - Sara Alvarez de Andrés
- Cytogenetics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernández Almagro 3, Madrid, 28029, Spain
| |
Collapse
|
988
|
Hu P, Greenwood CM, Beyene J. Integrative Analysis of Gene Expression Data Including an Assessment of Pathway Enrichment for Predicting Prostate Cancer. Cancer Inform 2006. [DOI: 10.1177/117693510600200018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Background Microarray technology has been previously used to identify genes that are differentially expressed between tumour and normal samples in a single study, as well as in syntheses involving multiple studies. When integrating results from several Affymetrix microarray datasets, previous studies summarized probeset-level data, which may potentially lead to a loss of information available at the probe-level. In this paper, we present an approach for integrating results across studies while taking probe-level data into account. Additionally, we follow a new direction in the analysis of microarray expression data, namely to focus on the variation of expression phenotypes in predefined gene sets, such as pathways. This targeted approach can be helpful for revealing information that is not easily visible from the changes in the individual genes. Results We used a recently developed method to integrate Affymetrix expression data across studies. The idea is based on a probe-level based test statistic developed for testing for differentially expressed genes in individual studies. We incorporated this test statistic into a classic random-effects model for integrating data across studies. Subsequently, we used a gene set enrichment test to evaluate the significance of enriched biological pathways in the differentially expressed genes identified from the integrative analysis. We compared statistical and biological significance of the prognostic gene expression signatures and pathways identified in the probe-level model (PLM) with those in the probeset-level model (PSLM). Our integrative analysis of Affymetrix microarray data from 110 prostate cancer samples obtained from three studies reveals thousands of genes significantly correlated with tumour cell differentiation. The bioinformatics analysis, mapping these genes to the publicly available KEGG database, reveals evidence that tumour cell differentiation is significantly associated with many biological pathways. In particular, we observed that by integrating information from the insulin signalling pathway into our prediction model, we achieved better prediction of prostate cancer. Conclusions Our data integration methodology provides an efficient way to identify biologically sound and statistically significant pathways from gene expression data. The significant gene expression phenotypes identified in our study have the potential to characterize complex genetic alterations in prostate cancer.
Collapse
Affiliation(s)
- Pingzhao Hu
- The Hospital for Sick Children Research Institute, 15-706 TMDT, 101 College Street, Toronto, ON, M5G 1L7, Canada
| | - Celia M.T. Greenwood
- The Hospital for Sick Children Research Institute, 15-706 TMDT, 101 College Street, Toronto, ON, M5G 1L7, Canada
- Department of Public Health Sciences, University of Toronto, Health Sciences Building, 155 College St, Toronto, ON, M5T 3M7, Canada
| | - Joseph Beyene
- Department of Public Health Sciences, University of Toronto, Health Sciences Building, 155 College St, Toronto, ON, M5T 3M7, Canada
- Child Health Evaluative Sciences, The Hospital for Sick Children Research Institute, 555 University Ave, Toronto, ON, M5G 1X8, Canada
| |
Collapse
|
989
|
Abstract
Random Forests is a powerful multipurpose tool for predicting and understanding data. If gene expression data come from known groups or classes (e.g., tumor patients and controls), Random Forests can rank the genes in terms of their usefulness in separating the groups. When the groups are unknown, Random Forests uses an intrinsic measure of the similarity of the genes to extract useful multivariate structure, including clusters. This chapter summarizes the Random Forests methodology and illustrates its use on freely available data sets.
Collapse
Affiliation(s)
- Adele Cutler
- Department of Mathematics and Statistics, Utah State University, Logan, UT, USA
| | | |
Collapse
|
990
|
Abstract
Twenty years ago, drug discovery was a somewhat plodding and scholastic endeavor; those days are gone. The intellectual challenges are greater than ever but the pace has changed. Although there are greater opportunities for therapeutic targets than ever before, the costs and risks are great and the increasingly competitive environment makes the pace of pharmaceutical drug hunting range from exciting to overwhelming. These changes are catalyzed by major changes to drug discovery processes through application of rapid parallel synthesis of large chemical libraries and high-throughput screening. These techniques result in huge volumes of data for use in decision making. Besides the size and complex nature of biological and chemical data sets and the many sources of data “noise”, the needs of business produce many, often conflicting, decision criteria and constraints such as time, cost, and patent caveats. The drive is still to find potent and selective molecules but, in recent years, key aspects of drug discovery are being shifted to earlier in the process. Discovery scientists are now concerned with building molecules that have good stability but also reasonable properties of absorption into the bloodstream, distribution and binding to tissues, metabolism and excretion, low toxicity, and reasonable cost of production. These requirements result in a high-dimensional decision problem with conflicting criteria and limited resources. An overview of the broad range of issues and activities involved in pharmaceutical screening is given along with references for further reading.
Collapse
|
991
|
Mao Y, Zhou X, Yin Z, Pi D, Sun Y, Wong STC. Gene Selection Using Gaussian Kernel Support Vector Machine Based Recursive Feature Elimination with Adaptive Kernel Width Strategy. ROUGH SETS AND KNOWLEDGE TECHNOLOGY 2006. [DOI: 10.1007/11795131_116] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
992
|
Stolf BS, Santos MMS, Simao DF, Diaz JP, Cristo EB, Hirata R, Curado MP, Neves EJ, Kowalski LP, Carvalho AF. Class distinction between follicular adenomas and follicular carcinomas of the thyroid gland on the basis of their signature expression. Cancer 2006; 106:1891-900. [PMID: 16565969 DOI: 10.1002/cncr.21826] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
BACKGROUND Nodules of the thyroid gland are observed frequently in patients who undergo ultrasound studies. The majority of these nodules are benign, corresponding to goiters or adenomas, and only a small fraction corresponds to carcinomas. Among thyroid tumors, the diagnosis of follicular adenocarcinomas by preoperative fine-needle aspiration biopsy is a major challenge, because it requires inspection of the entire capsule to differentiate it from adenoma. Consequently, large numbers of patients undergo unnecessary thyroidectomy. METHODS Using data from gene expression analysis, the authors applied Fisher linear discriminant analysis and searched for expression signatures of individual samples of adenomas and follicular carcinomas that could be used as molecular classifiers for the precise classification of malignant and nonmalignant lesions. RESULTS Fourteen trios of genes were described that fulfilled the criteria for the correct classification of 100% of samples. The robustness of these trios was verified by using leave-1-out cross-validation and bootstrap analyses. The results demonstrated that, by combining trios, better classifiers could be generated that correctly classified >92% of samples. CONCLUSIONS The strategy of classifiers based on individual signatures was a useful strategy for distinguishing between samples with very similar expression profiles.
Collapse
|
993
|
Shieh GS, Jiang Y, Shih YS. Comparison of Support Vector Machines to Other Classifiers Using Gene Expression Data. COMMUN STAT-SIMUL C 2006. [DOI: 10.1080/03610910500416215] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
994
|
Maggioli J, Hoover A, Weng L. Toxicogenomic analysis methods for predictive toxicology. J Pharmacol Toxicol Methods 2006; 53:31-7. [PMID: 16236530 DOI: 10.1016/j.vascn.2005.05.006] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2005] [Accepted: 05/23/2005] [Indexed: 12/26/2022]
Abstract
Toxicogenomics, the application of genomic data to elucidate or predict an organism's response to a toxicant, can inform the drug development process in important ways. It is apparent that standardized approaches to many types of toxicogenomic questions are still being formulated. Specifically, a significant body of proof of principle studies has emerged that demonstrates a range of statistical methodologies applied to predictive toxicology. These studies rely on class prediction methods--mathematical models generated using the gene expression profiles of known toxins from representative toxicological classes--to predict the toxicological effect of a compound based on the similarities between its gene expression profile and the profiles of a given toxicological class. Class prediction methods hold promise for increasing the rate at which compounds can be evaluated for toxicity early in the drug discovery process, while at the same time reducing the length of toxicological studies and their associated costs. Class prediction methods are informed by class comparison and class discovery steps, which inform, respectively, the selection of genes whose response can be used to distinguish among the toxicological classes and the number of classes distinguishable using the response of these genes. Together these steps use a variety of complementary statistical techniques to achieve a successful class prediction model. This report attempts to review some of the themes that appear to be emerging in the application of these techniques to predictive toxicology methods over toxicogenomics' short history.
Collapse
Affiliation(s)
- Jeff Maggioli
- Rosetta Biosoftware, 401 Terry Avenue, North Seattle, WA 98109, USA.
| | | | | |
Collapse
|
995
|
Leek JT, Monsen E, Dabney AR, Storey JD. EDGE: extraction and analysis of differential gene expression. ACTA ACUST UNITED AC 2005; 22:507-8. [PMID: 16357033 DOI: 10.1093/bioinformatics/btk005] [Citation(s) in RCA: 199] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
EDGE (Extraction of Differential Gene Expression) is an open source, point-and-click software program for the significance analysis of DNA microarray experiments. EDGE can perform both standard and time course differential expression analysis. The functions are based on newly developed statistical theory and methods. This document introduces the EDGE software package.
Collapse
Affiliation(s)
- Jeffrey T Leek
- Department of Biostatistics, University of Washington, Seattle 98195, USA.
| | | | | | | |
Collapse
|
996
|
Chen JJ, Tsai CA, Young JF, Kodell RL. Classification ensembles for unbalanced class sizes in predictive toxicology. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2005; 16:517-29. [PMID: 16428129 DOI: 10.1080/10659360500468468] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
This paper investigates the effects of the ratio of positive-to-negative samples on the sensitivity, specificity, and concordance. When the class sizes in the training samples are not equal, the classification rule derived will favor the majority class and result in a low sensitivity on the minority class prediction. We propose an ensemble classification approach to adjust for differential class sizes in a binary classifier system. An ensemble classifier consists of a set of base classifiers; its prediction rule is based on a summary measure of individual classifications by the base classifiers. Two re-sampling methods, augmentation and abatement, are proposed to generate different bootstrap samples of equal class size to build the base classifiers. The augmentation method balances the two class sizes by bootstrapping additional samples from the minority class, whereas the abatement method balances the two class sizes by sampling only a subset of samples from the majority class. The proposed procedure is applied to a data set to predict estrogen receptor binding activity and to a data set to predict animal liver carcinogenicity using SAR (structure-activity relationship) models as base classifiers. The abatement method appears to perform well in balancing sensitivity and specificity.
Collapse
Affiliation(s)
- J J Chen
- Division of Biometry and Risk Assessment, National Center for Toxicological Research, Food and Drug Administration, Jefferson, Arkansas 72079, USA.
| | | | | | | |
Collapse
|
997
|
Lee EK, Cook D, Klinke S, Lumley T. Projection Pursuit for Exploratory Supervised Classification. J Comput Graph Stat 2005. [DOI: 10.1198/106186005x77702] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
998
|
Brabender J, Marjoram P, Lord RVN, Metzger R, Salonga D, Vallböhmer D, Schäfer H, Danenberg KD, Danenberg PV, Selaru FM, Baldus SE, Hölscher AH, Meltzer SJ, Schneider PM. The molecular signature of normal squamous esophageal epithelium identifies the presence of a field effect and can discriminate between patients with Barrett's esophagus and patients with Barrett's-associated adenocarcinoma. Cancer Epidemiol Biomarkers Prev 2005; 14:2113-7. [PMID: 16172218 DOI: 10.1158/1055-9965.epi-05-0014] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND AND AIM Genetic alterations in the normal tissues surrounding various cancers have been described, but a comprehensive analysis of this carcinogenic field effect in Barrett's-associated adenocarcinoma of the esophagus disease has not been reported. The aim of this study was to analyze the gene expression profile of a panel of highly selected genes in the normal squamous esophagus epihelium of patients with Barrett's esophagus, patients with Barrett's-associated adenocarcinoma, and a healthy control group to define the existence of a carcinogenic field effect, and to investigate the clinical importance of such a field effect in the management of Barrett's disease. METHODS Forty-nine histologic normal squamous esophageal epithelia collected from 19 patients with Barrett's esophagus, 20 patients with Barrett's-associated esophageal adenocarcinoma, and a healthy control group of 10 patients were studied. A quantitative real-time reverse transcription-PCR method (TaqMan) was used to measure the expression of a panel of genes with known associations with gastrointestinal carcinogenesis. RESULTS A widespread carcinogenic field effect was detected for more than 50% of the genes analyzed including Bax, BFT, CDX2, COX2, DAPK, DNMT1, GSTP1, RARalpha, RARgamma, RXRalpha, RXRbeta, SPARC, TSPAN, and VEGF. Based on the expression signature of the normal appearing squamous esophagus, a linear discriminant analysis was able to distinguish between the three groups of patients with an error rate of 0%. CONCLUSION This study provides the first comprehensive investigation of a carcinogenic field effect in Barrett's esophagus disease. Based on the gene expression signature of the normal esophagus, patients could be correctly characterized according to their pathologic classification by applying a linear discriminant analysis. Our results provide evidence that a molecular classification might have clinical importance for the diagnosis and treatment of patients with Barrett's esophagus disease.
Collapse
Affiliation(s)
- Jan Brabender
- Department of Visceral and Vascular Surgery, University of Cologne, Germany.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
999
|
Curtin JA, Fridlyand J, Kageshita T, Patel HN, Busam KJ, Kutzner H, Cho KH, Aiba S, Bröcker EB, LeBoit PE, Pinkel D, Bastian BC. Distinct sets of genetic alterations in melanoma. N Engl J Med 2005; 353:2135-47. [PMID: 16291983 DOI: 10.1056/nejmoa050092] [Citation(s) in RCA: 1972] [Impact Index Per Article: 98.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
BACKGROUND Exposure to ultraviolet light is a major causative factor in melanoma, although the relationship between risk and exposure is complex. We hypothesized that the clinical heterogeneity is explained by genetically distinct types of melanoma with different susceptibility to ultraviolet light. METHODS We compared genome-wide alterations in the number of copies of DNA and mutational status of BRAF and N-RAS in 126 melanomas from four groups in which the degree of exposure to ultraviolet light differs: 30 melanomas from skin with chronic sun-induced damage and 40 melanomas from skin without such damage; 36 melanomas from palms, soles, and subungual (acral) sites; and 20 mucosal melanomas. RESULTS We found significant differences in the frequencies of regional changes in the number of copies of DNA and mutation frequencies in BRAF among the four groups of melanomas. Samples could be correctly classified into the four groups with 70 percent accuracy on the basis of the changes in the number of copies of genomic DNA. In two-way comparisons, melanomas arising on skin with signs of chronic sun-induced damage and skin without such signs could be correctly classified with 84 percent accuracy. Acral melanoma could be distinguished from mucosal melanoma with 89 percent accuracy. Eighty-one percent of melanomas on skin without chronic sun-induced damage had mutations in BRAF or N-RAS; the majority of melanomas in the other groups had mutations in neither gene. Melanomas with wild-type BRAF or N-RAS frequently had increases in the number of copies of the genes for cyclin-dependent kinase 4 (CDK4) and cyclin D1 (CCND1), downstream components of the RAS-BRAF pathway. CONCLUSIONS The genetic alterations identified in melanomas at different sites and with different levels of sun exposure indicate that there are distinct genetic pathways in the development of melanoma and implicate CDK4 and CCND1 as independent oncogenes in melanomas without mutations in BRAF or N-RAS.
Collapse
Affiliation(s)
- John A Curtin
- Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA 94143-0808, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1000
|
Abstract
MOTIVATION The classification of few tissue samples on a very large number of genes represents a non-standard problem in statistics but a usual one in microarray expression data analysis. In fact, the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. We consider high-density oligonucleotide microarray data, where the expression level is associated to an 'absolute call', which represents a qualitative indication of whether or not a transcript is detected within a sample. The 'absolute call' is generally not taken in consideration in analyses. RESULTS In contrast to frequently used cluster analysis methods to analyze gene expression data, we consider a problem of classification of tissues and of the variables selection. We adopted methodologies formulated by Ghahramani and Hinton and Rocci and Vichi for simultaneous dimensional reduction of genes and classification of tissues; trying to identify genes (denominated 'markers') that are able to distinguish between two known different classes of tissue samples. In this respect, we propose a generalization of the approach proposed by McLachlan et al. by advising to estimate the distribution of log LR statistic for testing one versus two component hypothesis in the mixture model for each gene considered individually, using a parametric bootstrap approach. We compare conditional (on 'absolute call') and unconditional analyses performed on dataset described in Golub et al. We show that the proposed techniques improve the results of classification of tissue samples with respect to known results on the same benchmark dataset. AVAILABILITY The software of Ghahramani and Hinton is written in Matlab and available in 'Mixture of Factor Analyzers' on http://www.gatsby.ucl.ac.uk/~zoubin/software.html while the software of Rocci and Vichi is available upon request from the authors.
Collapse
Affiliation(s)
- Francesca Martella
- Dipartimento di Statistica, Probabilità e Statistiche Applicate, Universitá degli Studi di Roma "La Sapienza" P.le A. Moro, 5-00185 Rome, Italy.
| |
Collapse
|