701
|
|
702
|
Zhang C, Wu C, Blanzieri E, Zhou Y, Wang Y, Du W, Liang Y. Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model. Bioinformatics 2009; 25:2708-14. [DOI: 10.1093/bioinformatics/btp478] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
703
|
The derivation of diagnostic markers of chronic myeloid leukemia progression from microarray data. Blood 2009; 114:3292-8. [PMID: 19654405 DOI: 10.1182/blood-2009-03-212969] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Currently, limited molecular markers exist that can determine where in the spectrum of chronic myeloid leukemia (CML) progression an individual patient falls at diagnosis. Gene expression profiles can predict disease and prognosis, but most widely used microarray analytical methods yield lengthy gene candidate lists that are difficult to apply clinically. Consequently, we applied a probabilistic method called Bayesian model averaging (BMA) to a large CML microarray dataset. BMA, a supervised method, considers multiple genes simultaneously and identifies small gene sets. BMA identified 6 genes (NOB1, DDX47, IGSF2, LTB4R, SCARB1, and SLC25A3) that discriminated chronic phase (CP) from blast crisis (BC) CML. In CML, phase labels divide disease progression into discrete states. BMA, however, produces posterior probabilities between 0 and 1 and predicts patients in "intermediate" stages. In validation studies of 88 patients, the 6-gene signature discriminated early CP from late CP, accelerated phase, and BC. This distinction between early and late CP is not possible with current classifications, which are based on known duration of disease. BMA is a powerful tool for developing diagnostic tests from microarray data. Because therapeutic outcomes are so closely tied to disease phase, these probabilities can be used to determine a risk-based treatment strategy at diagnosis.
Collapse
|
704
|
Nudurupati SV, Abebe A. A nonparametric allocation scheme for classification based on transvariation probabilities. J STAT COMPUT SIM 2009. [DOI: 10.1080/00949650802074472] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
705
|
A flexible approximate likelihood ratio test for detecting differential expression in microarray data. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2009.03.022] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
706
|
Tang LJ, Du W, Fu HY, Jiang JH, Wu HL, Shen GL, Yu RQ. New Variable Selection Method Using Interval Segmentation Purity with Application to Blockwise Kernel Transform Support Vector Machine Classification of High-Dimensional Microarray Data. J Chem Inf Model 2009; 49:2002-9. [DOI: 10.1021/ci900032q] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Li-Juan Tang
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Wen Du
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Hai-Yan Fu
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Jian-Hui Jiang
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Hai-Long Wu
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Guo-Li Shen
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Ru-Qin Yu
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| |
Collapse
|
707
|
De Mol C, Mosci S, Traskine M, Verri A. A regularized method for selecting nested groups of relevant genes from microarray data. J Comput Biol 2009; 16:677-90. [PMID: 19432538 DOI: 10.1089/cmb.2008.0171] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Gene expression analysis aims at identifying the genes able to accurately predict biological parameters like, for example, disease subtyping or progression. While accurate prediction can be achieved by means of many different techniques, gene identification, due to gene correlation and the limited number of available samples, is a much more elusive problem. Small changes in the expression values often produce different gene lists, and solutions which are both sparse and stable are difficult to obtain. We propose a two-stage regularization method able to learn linear models characterized by a high prediction performance. By varying a suitable parameter these linear models allow to trade sparsity for the inclusion of correlated genes and to produce gene lists which are almost perfectly nested. Experimental results on synthetic and microarray data confirm the interesting properties of the proposed method and its potential as a starting point for further biological investigations.
Collapse
Affiliation(s)
- Christine De Mol
- Department of Mathematics, Université Libre de Bruxelles, Brussels, Belgium
| | | | | | | |
Collapse
|
708
|
Ferrer-Luna R, Mata M, Núñez L, Calvar J, Dasí F, Arias E, Piquer J, Cerdá-Nicolás M, Taratuto AL, Sevlever G, Celda B, Martinetto H. Loss of heterozygosity at 1p-19q induces a global change in oligodendroglial tumor gene expression. J Neurooncol 2009; 95:343-354. [PMID: 19597701 DOI: 10.1007/s11060-009-9944-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2009] [Accepted: 06/15/2009] [Indexed: 11/28/2022]
Abstract
Oligodendroglial tumors presenting loss of heterozygosity (LOH) at 1p and 19q have been shown to be sensitive to chemotherapy, thus making 1p-19q status testing a key aspect in oligodendroglioma diagnosis and prognosis. Twenty-nine tumor samples (19 oligodendrogliomas, 10 oligoastrocytomas) were analyzed in order to obtain a molecular profile identifying those bearing 1p-19q LOH. Other genomic anomalies usually present in gliomas, such as EGFR amplification, CDKN2A/ARF deletion, 10q LOH and TP53 mutation, were also studied. Tumors with 1p-19q LOH overexpressed genes related to neurogenesis. Genes linked to immune response, proliferation and inflammation were overexpressed in the group with intact 1p-19q; this group could in turn be further divided in two subgroups: one overexpressing genes involved in immune response and inflammation that did not show major genetic aberrations other than the TP53 mutation and EGFR trisomy in a few cases, and another overexpressing genes related to immune response and proliferation that had a predominance of samples carrying several anomalies and presenting worse outcomes. This molecular signature was validated by analyzing a set of ten tumor samples (three oligodendrogliomas, seven oligoastrocytomas); all ten samples were correctly assigned. LOH at 1p-19q results in haploinsufficiency and copy number reduction of several genes, including NOTCH 2; this phenomenon produces a global change in gene expression inducing a pro-neural status that results in restrictions to cell migration and proliferation. Tumors without LOH at 1p-19q exhibit the opposite characteristics, explaining their more aggressive behavior.
Collapse
Affiliation(s)
- Rubén Ferrer-Luna
- Department of Physical Chemistry, Universitat de Valencia, Dr. Moliner sn., 46100, Burjassot, Valencia, Spain
| | - Manuel Mata
- Research Foundation, Hospital General Universitario de Valencia, Valencia, Spain
| | - Lina Núñez
- Department of Neuropathology, FLENI, Montañeses 2325 (C1428AQK), Buenos Aires, Argentina
| | - Jorge Calvar
- Department of Neuroimaging, FLENI, Buenos Aires, Argentina
| | - Francisco Dasí
- Research Foundation, Hospital Clínico Universitario, Valencia, Spain
| | - Eugenia Arias
- Department of Neuropathology, FLENI, Montañeses 2325 (C1428AQK), Buenos Aires, Argentina
| | - José Piquer
- Neurosurgery Service, Hospital de la Ribera-Alzira, Valencia, Spain
| | | | - Ana Lía Taratuto
- Department of Neuropathology, FLENI, Montañeses 2325 (C1428AQK), Buenos Aires, Argentina
| | - Gustavo Sevlever
- Department of Neuropathology, FLENI, Montañeses 2325 (C1428AQK), Buenos Aires, Argentina
| | - Bernardo Celda
- Department of Physical Chemistry, Universitat de Valencia, Dr. Moliner sn., 46100, Burjassot, Valencia, Spain. .,CIBER BBN, ISC-III, Valencia, Spain.
| | - Horacio Martinetto
- Department of Neuropathology, FLENI, Montañeses 2325 (C1428AQK), Buenos Aires, Argentina.
| |
Collapse
|
709
|
Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 2009; 10:213. [PMID: 19591666 PMCID: PMC2724423 DOI: 10.1186/1471-2105-10-213] [Citation(s) in RCA: 444] [Impact Index Per Article: 27.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2009] [Accepted: 07/10/2009] [Indexed: 11/20/2022] Open
Abstract
Background Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space. Results We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features. Conclusion The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but – on an optimal subset of features – the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.
Collapse
Affiliation(s)
- Bjoern H Menze
- Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg, Heidelberg, Germany.
| | | | | | | | | | | | | |
Collapse
|
710
|
Combining dissimilarities in a Hyper Reproducing Kernel Hilbert Space for complex human cancer prediction. J Biomed Biotechnol 2009; 2009:906865. [PMID: 19584909 PMCID: PMC2699662 DOI: 10.1155/2009/906865] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2009] [Accepted: 03/24/2009] [Indexed: 11/18/2022] Open
Abstract
DNA microarrays provide rich profiles that are used in
cancer prediction considering the gene expression levels
across a collection of related samples. Support Vector Machines
(SVM) have been applied to the classification of cancer
samples with encouraging results. However, they rely on
Euclidean distances that fail to reflect accurately the proximities
among sample profiles. Then, non-Euclidean dissimilarities
provide additional information that should be considered
to reduce the misclassification errors.
In this paper, we incorporate in the ν-SVM algorithm a
linear combination of non-Euclidean dissimilarities. The
weights of the combination are learnt in a (Hyper
Reproducing Kernel Hilbert Space) HRKHS using a Semidefinite
Programming algorithm. This approach allows us to incorporate
a smoothing term that penalizes the complexity of the
family of distances and avoids overfitting. The experimental results suggest that the method proposed
helps to reduce the misclassification errors in several
human cancer problems.
Collapse
|
711
|
Ooi CH, Chetty M, Teng SW. Degree of differential prioritization: prediction for multiclass molecular classification. IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE : THE QUARTERLY MAGAZINE OF THE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY 2009; 28:45-51. [PMID: 19622424 DOI: 10.1109/memb.2009.932917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Affiliation(s)
- Chia Huey Ooi
- Duke-NUS Graduate Medical School Singapore, 2 Jalan Bukit Merah, Singapore.
| | | | | |
Collapse
|
712
|
Suzuki I, Takenouchi T, Ohira M, Oba S, Ishii S. Robust model selection for classification of microarrays. Cancer Inform 2009; 7:141-57. [PMID: 19718450 PMCID: PMC2730179 DOI: 10.4137/cin.s2704] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Recently, microarray-based cancer diagnosis systems have been increasingly investigated. However, cost reduction and reliability assurance of such diagnosis systems are still remaining problems in real clinical scenes. To reduce the cost, we need a supervised classifier involving the smallest number of genes, as long as the classifier is sufficiently reliable. To achieve a reliable classifier, we should assess candidate classifiers and select the best one. In the selection process of the best classifier, however, the assessment criterion must involve large variance because of limited number of samples and non-negligible observation noise. Therefore, even if a classifier with a very small number of genes exhibited the smallest leave-one-out cross-validation (LOO) error rate, it would not necessarily be reliable because classifiers based on a small number of genes tend to show large variance. We propose a robust model selection criterion, the min-max criterion, based on a resampling bootstrap simulation to assess the variance of estimation of classification error rates. We applied our assessment framework to four published real gene expression datasets and one synthetic dataset. We found that a state-of-the-art procedure, weighted voting classifiers with LOO criterion, had a non-negligible risk of selecting extremely poor classifiers and, on the other hand, that the new min-max criterion could eliminate that risk. These finding suggests that our criterion presents a safer procedure to design a practical cancer diagnosis system.
Collapse
Affiliation(s)
- Ikumi Suzuki
- Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma, Nara, Japan
| | | | | | | | | |
Collapse
|
713
|
Abstract
We describe a new stochastic search algorithm for linear regression models called the bounded mode stochastic search (BMSS). We make use of BMSS to perform variable selection and classification as well as to construct sparse dependency networks. Furthermore, we show how to determine genetic networks from genomewide data that involve any combination of continuous and discrete variables. We illustrate our methodology with several real-world data sets.
Collapse
Affiliation(s)
- Adrian Dobra
- Department of Statistics and Department of Biobehavioral Nursing and Health Systems, University of Washington Seattle, WA 98195, USA.
| |
Collapse
|
714
|
Kumar KK, Pugalenthi G, Suganthan PN. DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest. J Biomol Struct Dyn 2009; 26:679-86. [DOI: 10.1080/07391102.2009.10507281] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
715
|
Van Dyke AL, Cote ML, Wenzlaff AS, Chen W, Abrams J, Land S, Giroux CN, Schwartz AG. Cytokine and cytokine receptor single-nucleotide polymorphisms predict risk for non-small cell lung cancer among women. Cancer Epidemiol Biomarkers Prev 2009; 18:1829-40. [PMID: 19505916 PMCID: PMC3771080 DOI: 10.1158/1055-9965.epi-08-0962] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
Studies on the relationships between inflammatory pathway genes and lung cancer risk have not included African-Americans and have only included a handful of genes. In a population-based case-control study on 198 African-American and 744 Caucasian women, we examined the association between 70 cytokine and cytokine receptor single-nucleotide polymorphisms (SNPs) and risk of non-small cell lung cancer (NSCLC). Unconditional logistic regression was used to estimate odds ratios and 95% confidence intervals in a dominant model adjusting for major risk factors for lung cancer. Separate analyses were conducted by race and by smoking history and history of chronic obstructive pulmonary disease among Caucasians. Random forest analysis was conducted by race. On logistic regression analysis, IL6 (interleukin 6), IL7R, IL15, TNF (tumor necrosis factor), and IL10 SNP were associated with risk of non-small cell lung cancer among African-Americans; IL7R and IL10 SNPs were also associated with risk of lung cancer among Caucasians. Although random forest analysis showed IL7R and IL10 SNPs as being associated with risk for lung cancer among African-Americans, it also identified TNFRSF10A SNP as an important predictor. On random forest analysis, an IL1A SNP was identified as an important predictor of lung cancer among Caucasian women. Inflammatory SNPs differentially predicted risk for NSCLC according to race, as well as based on smoking history and history of chronic obstructive pulmonary disease among Caucasian women. Pathway analysis results are presented. Inflammatory pathway genotypes may serve to define a high risk group; further exploration of these genes in minority populations is warranted.
Collapse
Affiliation(s)
- Alison L Van Dyke
- Department of Obstetrics and Gynecology, Wayne State University School of Medicine, Detroit, Michigan, USA.
| | | | | | | | | | | | | | | |
Collapse
|
716
|
Abstract
Background Because screening mammography for breast cancer is less effective for premenopausal women, we investigated the feasibility of a diagnostic blood test using serum proteins. Methods This study used a set of 98 serum proteins and chose diagnostically relevant subsets via various feature-selection techniques. Because of significant noise in the data set, we applied iterated Bayesian model averaging to account for model selection uncertainty and to improve generalization performance. We assessed generalization performance using leave-one-out cross-validation (LOOCV) and receiver operating characteristic (ROC) curve analysis. Results The classifiers were able to distinguish normal tissue from breast cancer with a classification performance of AUC = 0.82 ± 0.04 with the proteins MIF, MMP-9, and MPO. The classifiers distinguished normal tissue from benign lesions similarly at AUC = 0.80 ± 0.05. However, the serum proteins of benign and malignant lesions were indistinguishable (AUC = 0.55 ± 0.06). The classification tasks of normal vs. cancer and normal vs. benign selected the same top feature: MIF, which suggests that the biomarkers indicated inflammatory response rather than cancer. Conclusion Overall, the selected serum proteins showed moderate ability for detecting lesions. However, they are probably more indicative of secondary effects such as inflammation rather than specific for malignancy.
Collapse
|
717
|
Kreike B, Halfwerk H, Armstrong N, Bult P, Foekens JA, Veltkamp SC, Nuyten DSA, Bartelink H, van de Vijver MJ. Local recurrence after breast-conserving therapy in relation to gene expression patterns in a large series of patients. Clin Cancer Res 2009; 15:4181-90. [PMID: 19470741 DOI: 10.1158/1078-0432.ccr-08-2644] [Citation(s) in RCA: 75] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
PURPOSE The majority of patients with early-stage breast cancer are treated with breast-conserving therapy (BCT). Several clinical risk factors are associated with local recurrence (LR) after BCT but are unable to explain all instances of LR after BCT. Here, gene expression microarrays are used to identify novel risk factors for LR after BCT. EXPERIMENTAL DESIGN Gene expression profiles of 56 primary invasive breast carcinomas from patients who developed a LR after BCT were compared with profiles of 109 tumors from patients who did not develop a LR after BCT. Both unsupervised and supervised methods of classification were used to separate patients into groups corresponding to disease outcome. In addition, for 15 patients, the gene expression profile in the recurrence was compared with that of the primary tumor. RESULTS The two main clusters found by hierarchical cluster analysis of all 165 primary invasive breast carcinomas revealed no association with LR. Predefined gene sets (molecular subtypes and "chromosomal instability" signature) are associated with LR (P = 0.0002 and 0.003, respectively). Significant analysis of microarrays revealed an association between LR and cell proliferation, not captured by histologic grading. Class prediction analysis constructed a gene classifier, which was successfully validated, cross-platform, on an independent data set of 161 patients (log-rank P = 0.041). In multivariate analysis, young age was the only independent predictor of LR. CONCLUSIONS We have constructed and cross-platform validated a gene expression profile predictive for LR after BCT, which is characterized by genes involved in cell proliferation but not a surrogate for high histologic grade.
Collapse
Affiliation(s)
- Bas Kreike
- Division of Radiation Oncology, The Netherlands Cancer Institute, Amsterdam, the Netherlands
| | | | | | | | | | | | | | | | | |
Collapse
|
718
|
Kim SY. Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC Bioinformatics 2009; 10:147. [PMID: 19445687 PMCID: PMC2689196 DOI: 10.1186/1471-2105-10-147] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2008] [Accepted: 05/16/2009] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Few overlap between independently developed gene signatures and poor inter-study applicability of gene signatures are two of major concerns raised in the development of microarray-based prognostic gene signatures. One recent study suggested that thousands of samples are needed to generate a robust prognostic gene signature. RESULTS A data set of 1,372 samples was generated by combining eight breast cancer gene expression data sets produced using the same microarray platform and, using the data set, effects of varying samples sizes on a few performances of a prognostic gene signature were investigated. The overlap between independently developed gene signatures was increased linearly with more samples, attaining an average overlap of 16.56% with 600 samples. The concordance between predicted outcomes by different gene signatures also was increased with more samples up to 94.61% with 300 samples. The accuracy of outcome prediction also increased with more samples. Finally, analysis using only Estrogen Receptor-positive (ER+) patients attained higher prediction accuracy than using both patients, suggesting that sub-type specific analysis can lead to the development of better prognostic gene signatures CONCLUSION Increasing sample sizes generated a gene signature with better stability, better concordance in outcome prediction, and better prediction accuracy. However, the degree of performance improvement by the increased sample size was different between the degree of overlap and the degree of concordance in outcome prediction, suggesting that the sample size required for a study should be determined according to the specific aims of the study.
Collapse
Affiliation(s)
- Seon-Young Kim
- Medical Genomics Research Center, KRIBB, Yuseong-Gu, Daejeon, Republic of Korea.
| |
Collapse
|
719
|
Is bagging effective in the classification of small-sample genomic and proteomic data? EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2009:158368. [PMID: 19390645 PMCID: PMC3171418 DOI: 10.1155/2009/158368] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2008] [Revised: 12/04/2008] [Accepted: 01/19/2009] [Indexed: 11/18/2022]
Abstract
There has been considerable interest recently in the application of bagging in the classification of both gene-expression data and protein-abundance mass spectrometry data. The approach is often justified by the improvement it produces on the performance of unstable, overfitting classification rules under small-sample situations. However, the question of real practical interest is whether the ensemble scheme will improve performance of those classifiers sufficiently to beat the performance of single stable, nonoverfitting classifiers, in the case of small-sample genomic and proteomic data sets. To investigate that question, we conducted a detailed empirical study, using publicly-available data sets from published genomic and proteomic studies. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overfitting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, nonoverfitting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, as expected, the ensemble method did not improve the performance of these classifiers significantly. Representative experimental results are presented and discussed in this work.
Collapse
|
720
|
Zheng CH, Huang DS, Zhang L, Kong XZ. Tumor clustering using nonnegative matrix factorization with gene selection. ACTA ACUST UNITED AC 2009; 13:599-607. [PMID: 19369170 DOI: 10.1109/titb.2009.2018115] [Citation(s) in RCA: 116] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Tumor clustering is becoming a powerful method in cancer class discovery. Nonnegative matrix factorization (NMF) has shown advantages over other conventional clustering techniques. Nonetheless, there is still considerable room for improving the performance of NMF. To this end, in this paper, gene selection and explicitly enforcing sparseness are introduced into the factorization process. Particularly, independent component analysis is employed to select a subset of genes so that the effect of irrelevant or noisy genes can be reduced. The NMF and its extensions, sparse NMF and NMF with sparseness constraint, are then used for tumor clustering on the selected genes. A series of elaborate experiments are performed by varying the number of clusters and the number of selected genes to evaluate the cooperation between different gene selection settings and NMF-based clustering. Finally, the experiments on three representative gene expression datasets demonstrated that the proposed scheme can achieve better clustering results.
Collapse
Affiliation(s)
- Chun-Hou Zheng
- Intelligent Computing Laboratory, Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei 230031, China.
| | | | | | | |
Collapse
|
721
|
Baek S, Tsai CA, Chen JJ. Development of biomarker classifiers from high-dimensional data. Brief Bioinform 2009; 10:537-46. [PMID: 19346320 DOI: 10.1093/bib/bbp016] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Recent development of high-throughput technology has accelerated interest in the development of molecular biomarker classifiers for safety assessment, disease diagnostics and prognostics, and prediction of response for patient assignment. This article reviews and evaluates some important aspects and key issues in the development of biomarker classifiers. Development of a biomarker classifier for high-throughput data involves two components: (i) model building and (ii) performance assessment. This article focuses on feature selection in model building and cross validation for performance assessment. A 'frequency' approach to feature selection is presented and compared to the 'conventional' approach in terms of the predictive accuracy and stability of the selected feature set. The two approaches are compared based on four biomarker classifiers, each with a different feature selection method and well-known classification algorithm. In each of the four classifiers the feature predictor set selected by the frequency approach is more stable than the feature set selected by the conventional approach.
Collapse
Affiliation(s)
- Songjoon Baek
- National Center for Toxicological Research, U.S. Food and Drug Administration, USA.
| | | | | |
Collapse
|
722
|
Chan YB, Hall P. Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika 2009. [DOI: 10.1093/biomet/asp007] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
723
|
Yukinawa N, Oba S, Kato K, Ishii S. Optimal aggregation of binary classifiers for multiclass cancer diagnosis using gene expression profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:333-343. [PMID: 19407356 DOI: 10.1109/tcbb.2007.70239] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Multiclass classification is one of the fundamental tasks in bioinformatics and typically arises in cancer diagnosis studies by gene expression profiling. There have been many studies of aggregating binary classifiers to construct a multiclass classifier based on one-versus-the-rest (1R), one-versus-one (11), or other coding strategies, as well as some comparison studies between them. However, the studies found that the best coding depends on each situation. Therefore, a new problem, which we call the "optimal coding problem," has arisen: how can we determine which coding is the optimal one in each situation? To approach this optimal coding problem, we propose a novel framework for constructing a multiclass classifier, in which each binary classifier to be aggregated has a weight value to be optimally tuned based on the observed data. Although there is no a priori answer to the optimal coding problem, our weight tuning method can be a consistent answer to the problem. We apply this method to various classification problems including a synthesized data set and some cancer diagnosis data sets from gene expression profiling. The results demonstrate that, in most situations, our method can improve classification accuracy over simple voting heuristics and is better than or comparable to state-of-the-art multiclass predictors.
Collapse
Affiliation(s)
- Naoto Yukinawa
- Graduate School of Information Sciences, Nara Institute of Science and Technology, Ikoma, Nara, Japan.
| | | | | | | |
Collapse
|
724
|
Paul TK, Iba H. Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:353-367. [PMID: 19407358 DOI: 10.1109/tcbb.2007.70245] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
In order to get a better understanding of different types of cancers and to find the possible biomarkers for diseases, recently, many researchers are analyzing the gene expression data using various machine learning techniques. However, due to a very small number of training samples compared to the huge number of genes and class imbalance, most of these methods suffer from overfitting. In this paper, we present a majority voting genetic programming classifier (MVGPC) for the classification of microarray data. Instead of a single rule or a single set of rules, we evolve multiple rules with genetic programming (GP) and then apply those rules to test samples to determine their labels with majority voting technique. By performing experiments on four different public cancer data sets, including multiclass data sets, we have found that the test accuracies of MVGPC are better than those of other methods, including AdaBoost with GP. Moreover, some of the more frequently occurring genes in the classification rules are known to be associated with the types of cancers being studied in this paper.
Collapse
Affiliation(s)
- Topon Kumar Paul
- System Engineering Laboratory, Corporate Research & Development Center, Toshiba Corporation, Kawasaki-shi, Kanagawa, Japan.
| | | |
Collapse
|
725
|
Chuang LY, Ke CH, Chang HW, Yang CH. A Two-Stage Feature Selection Method for Gene Expression Data. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2009; 13:127-37. [DOI: 10.1089/omi.2008.0083] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Affiliation(s)
- Li-Yeh Chuang
- Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan, Republic of China
| | - Chao-Hsuan Ke
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, Republic of China
| | - Hsueh-Wei Chang
- Faculty of Biomedical Science and Environmental Biology, Kaohsiung Medical University, Taiwan, Republic of China
- Graduate Institute of Natural Products, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan, Republic of China
- Center of Excellence for Environmental Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan, Republic of China
| | - Cheng-Hong Yang
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, Republic of China
| |
Collapse
|
726
|
Liu LYD, Chen CY, Chen MJM, Tsai MS, Lee CHS, Phang TL, Chang LY, Kuo WH, Hwa HL, Lien HC, Jung SM, Lin YS, Chang KJ, Hsieh FJ. Statistical identification of gene association by CID in application of constructing ER regulatory network. BMC Bioinformatics 2009; 10:85. [PMID: 19292896 PMCID: PMC2679734 DOI: 10.1186/1471-2105-10-85] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2008] [Accepted: 03/17/2009] [Indexed: 02/01/2023] Open
Abstract
Background A variety of high-throughput techniques are now available for constructing comprehensive gene regulatory networks in systems biology. In this study, we report a new statistical approach for facilitating in silico inference of regulatory network structure. The new measure of association, coefficient of intrinsic dependence (CID), is model-free and can be applied to both continuous and categorical distributions. When given two variables X and Y, CID answers whether Y is dependent on X by examining the conditional distribution of Y given X. In this paper, we apply CID to analyze the regulatory relationships between transcription factors (TFs) (X) and their downstream genes (Y) based on clinical data. More specifically, we use estrogen receptor α (ERα) as the variable X, and the analyses are based on 48 clinical breast cancer gene expression arrays (48A). Results The analytical utility of CID was evaluated in comparison with four commonly used statistical methods, Galton-Pearson's correlation coefficient (GPCC), Student's t-test (STT), coefficient of determination (CoD), and mutual information (MI). When being compared to GPCC, CoD, and MI, CID reveals its preferential ability to discover the regulatory association where distribution of the mRNA expression levels on X and Y does not fit linear models. On the other hand, when CID is used to measure the association of a continuous variable (Y) against a discrete variable (X), it shows similar performance as compared to STT, and appears to outperform CoD and MI. In addition, this study established a two-layer transcriptional regulatory network to exemplify the usage of CID, in combination with GPCC, in deciphering gene networks based on gene expression profiles from patient arrays. Conclusion CID is shown to provide useful information for identifying associations between genes and transcription factors of interest in patient arrays. When coupled with the relationships detected by GPCC, the association predicted by CID are applicable to the construction of transcriptional regulatory networks. This study shows how information from different data sources and learning algorithms can be integrated to investigate whether relevant regulatory mechanisms identified in cell models can also be partially re-identified in clinical samples of breast cancers. Availability the implementation of CID in R codes can be freely downloaded from .
Collapse
Affiliation(s)
- Li-Yu D Liu
- Department of Agronomy, Biometry Division, National Taiwan University, Taipei, Taiwan.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
727
|
Sparse representation for classification of tumors using gene expression data. J Biomed Biotechnol 2009; 2009:403689. [PMID: 19300522 PMCID: PMC2655631 DOI: 10.1155/2009/403689] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2008] [Accepted: 01/12/2009] [Indexed: 11/17/2022] Open
Abstract
Personalized drug design requires the classification of cancer patients as accurate as possible. With advances in genome sequencing and microarray technology, a large amount of gene expression data has been and will continuously be produced from various cancerous patients. Such cancer-alerted gene expression data allows us to classify tumors at the genomewide level. However, cancer-alerted gene expression datasets typically have much more number of genes (features) than that of samples (patients), which imposes a challenge for classification of tumors. In this paper, a new method is proposed for cancer diagnosis using gene expression data by casting the classification problem as finding sparse representations of test samples with respect to training samples. The sparse representation is computed by the l(1)-regularized least square method. To investigate its performance, the proposed method is applied to six tumor gene expression datasets and compared with various support vector machine (SVM) methods. The experimental results have shown that the performance of the proposed method is comparable with or better than those of SVMs. In addition, the proposed method is more efficient than SVMs as it has no need of model selection.
Collapse
|
728
|
van Wieringen WN, Kun D, Hampel R, Boulesteix AL. Survival prediction using gene expression data: A review and comparison. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.05.021] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
729
|
Shim J, Sohn I, Kim S, Lee JW, Green PE, Hwang C. Selecting marker genes for cancer classification using supervised weighted kernel clustering and the support vector machine. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.04.028] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
730
|
Khalili A, Huang T, Lin S. A Robust Unified Approach to Analyzing Methylation and Gene Expression Data. Comput Stat Data Anal 2009; 53:1701-1710. [PMID: 20161265 DOI: 10.1016/j.csda.2008.07.010] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Microarray technology has made it possible to investigate expression levels, and more recently methylation signatures, of thousands of genes simultaneously, in a biological sample. Since more and more data from different biological systems or technological platforms are being generated at an incredible rate, there is an increasing need to develop statistical methods that are applicable to multiple data types and platforms. Motivated by such a need, a flexible finite mixture model that is applicable to methylation, gene expression, and potentially data from other biological systems, is proposed. Two major thrusts of this approach are to allow for a variable number of components in the mixture to capture non-biological variation and small biases, and to use a robust procedure for parameter estimation and probe classification. The method was applied to the analysis of methylation signatures of three breast cancer cell lines. It was also tested on three sets of expression microarray data to study its power and type I error rates. Comparison with a number of existing methods in the literature yielded very encouraging results; lower type I error rates and comparable/better power were achieved based on the limited study. Furthermore, the method also leads to more biologically interpretable results for the three breast cancer cell lines.
Collapse
Affiliation(s)
- Abbas Khalili
- Department of Statistics, The Ohio State University, Columbus, OH 43210, United States
| | | | | |
Collapse
|
731
|
Xu P, Brock GN, Parrish RS. Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.02.005] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
732
|
Yang TY. Simple Bayesian binary framework for discovering significant genes and classifying cancer diagnosis. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.04.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
733
|
Annest A, Bumgarner RE, Raftery AE, Yeung KY. Iterative Bayesian Model Averaging: a method for the application of survival analysis to high-dimensional microarray data. BMC Bioinformatics 2009; 10:72. [PMID: 19245714 PMCID: PMC2657791 DOI: 10.1186/1471-2105-10-72] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2008] [Accepted: 02/26/2009] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Microarray technology is increasingly used to identify potential biomarkers for cancer prognostics and diagnostics. Previously, we have developed the iterative Bayesian Model Averaging (BMA) algorithm for use in classification. Here, we extend the iterative BMA algorithm for application to survival analysis on high-dimensional microarray data. The main goal in applying survival analysis to microarray data is to determine a highly predictive model of patients' time to event (such as death, relapse, or metastasis) using a small number of selected genes. Our multivariate procedure combines the effectiveness of multiple contending models by calculating the weighted average of their posterior probability distributions. Our results demonstrate that our iterative BMA algorithm for survival analysis achieves high prediction accuracy while consistently selecting a small and cost-effective number of predictor genes. RESULTS We applied the iterative BMA algorithm to two cancer datasets: breast cancer and diffuse large B-cell lymphoma (DLBCL) data. On the breast cancer data, the algorithm selected a total of 15 predictor genes across 84 contending models from the training data. The maximum likelihood estimates of the selected genes and the posterior probabilities of the selected models from the training data were used to divide patients in the test (or validation) dataset into high- and low-risk categories. Using the genes and models determined from the training data, we assigned patients from the test data into highly distinct risk groups (as indicated by a p-value of 7.26e-05 from the log-rank test). Moreover, we achieved comparable results using only the 5 top selected genes with 100% posterior probabilities. On the DLBCL data, our iterative BMA procedure selected a total of 25 genes across 3 contending models from the training data. Once again, we assigned the patients in the validation set to significantly distinct risk groups (p-value = 0.00139). CONCLUSION The strength of the iterative BMA algorithm for survival analysis lies in its ability to account for model uncertainty. The results from this study demonstrate that our procedure selects a small number of genes while eclipsing other methods in predictive performance, making it a highly accurate and cost-effective prognostic tool in the clinical setting.
Collapse
Affiliation(s)
- Amalia Annest
- Institute of Technology/Computing and Software Systems, Box 358426, University of Washington, Tacoma, WA 98402, USA
| | - Roger E Bumgarner
- Department of Microbiology, Box 358070, University of Washington, Seattle, WA 98195, USA
| | - Adrian E Raftery
- Department of Statistics, Box 354320, University of Washington, Seattle, WA 98195, USA
| | - Ka Yee Yeung
- Department of Microbiology, Box 358070, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
734
|
André F, Michiels S, Dessen P, Scott V, Suciu V, Uzan C, Lazar V, Lacroix L, Vassal G, Spielmann M, Vielh P, Delaloge S. Exonic expression profiling of breast cancer and benign lesions: a retrospective analysis. Lancet Oncol 2009; 10:381-90. [PMID: 19249242 DOI: 10.1016/s1470-2045(09)70024-5] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
BACKGROUND Gene-expression arrays have generated molecular predictors of relapse and drug sensitivity in breast cancer. We aimed to identify exons differently expressed in malignant and benign breast lesions and to generate a molecular classifier for breast-cancer diagnosis. METHODS 165 breast samples were obtained by fine-needle aspiration. Complementary DNA was hybridised on splice array. A nearest centroid prediction rule was developed to classify lesions as malignant or benign on a training set, and its performance was assessed on an independent validation set. A two-way ANOVA model identified probe sets with differential expression in malignant and benign lesions while adjusting for scan dates. FINDINGS 120 breast cancers and 45 benign lesions were included in the study. A molecular classifier for breast-cancer diagnosis with 1228 probe sets was generated from the training set (n=94). This signature accurately classified all samples (100% accuracy, 95% CI 96-100%). In the validation set (n=71), the molecular predictor accurately classified 68 of 71 tumours (96%, 88-99%). When the 165 samples were taken into account, 37 858 exon probe sets (5.4%) and 3733 genes (7.0%) were differently expressed in malignant and benign lesions (threshold: adjusted p<0.05). Genes involved in spliceosome assembly were significantly overexpressed in malignant disease (permutation p=0.002). In the same population of 165 samples, 956 exon probe sets presented both higher intensity and higher splice index in breast cancer than in benign lesions, although located on unchanged genes. INTERPRETATION Many exons are differently expressed by breast cancer and benign lesions, and alternative transcripts contribute to the molecular characteristics of breast malignancy. Development of molecular classifiers for breast-cancer diagnosis with fine-needle aspiration should be possible.
Collapse
Affiliation(s)
- Fabrice André
- Breast Cancer Unit, Department of Medical Oncology, Institut Gustave Roussy France, Villejuif, France
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
735
|
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A. A Robust Maximal F-Ratio Statistic to Detect Clusters Structure. COMMUN STAT-THEOR M 2009. [DOI: 10.1080/03610920802287297] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- L. A. García-Escudero
- a Departamento de Estadística e Investigación Operativa , Universidad de Valladolid , Valladolid , Spain
| | - A. Gordaliza
- a Departamento de Estadística e Investigación Operativa , Universidad de Valladolid , Valladolid , Spain
| | - C. Matrán
- a Departamento de Estadística e Investigación Operativa , Universidad de Valladolid , Valladolid , Spain
| | - A. Mayo-Iscar
- a Departamento de Estadística e Investigación Operativa , Universidad de Valladolid , Valladolid , Spain
| |
Collapse
|
736
|
How to improve postgenomic knowledge discovery using imputation. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2009:717136. [PMID: 19223972 PMCID: PMC3171441 DOI: 10.1155/2009/717136] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2008] [Revised: 09/08/2008] [Accepted: 11/04/2008] [Indexed: 11/18/2022]
Abstract
While microarrays make it feasible to rapidly investigate many complex biological problems, their multistep fabrication has the proclivity for error at every stage. The standard tactic has been to either ignore or regard erroneous gene readings as missing values, though this assumption can exert a major influence upon postgenomic knowledge discovery methods like gene selection and gene regulatory network (GRN) reconstruction. This has been the catalyst for a raft of new flexible imputation algorithms including local least square impute and the recent heuristic collateral missing value imputation, which exploit the biological transactional behaviour of functionally correlated genes to afford accurate missing value estimation. This paper examines the influence of missing value imputation techniques upon postgenomic knowledge inference methods with results for various algorithms consistently corroborating that instead of ignoring missing values, recycling microarray data by flexible and robust imputation can provide substantial performance benefits for subsequent downstream procedures.
Collapse
|
737
|
Sylvester RJ. Combining a molecular profile with a clinical and pathological profile: biostatistical considerations. ACTA ACUST UNITED AC 2009:185-90. [PMID: 18815933 DOI: 10.1080/03008880802283847] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
The use of molecular markers and gene expression profiling provides a promising approach for improving the predictive accuracy of current prognostic indices for predicting which patients with non-muscle-invasive bladder cancer will progress to muscle-invasive disease. There are many statistical pitfalls in establishing the benefit of a multigene expression classifier during its development. First, there are issues related to the identification of the individual genes and the false discovery rate, the instability of the genes identified and their combination into a classifier. Secondly, the classifier should be validated, preferably on an independent data set, to show its reproducibility. Next, it is necessary to show that adding the classifier to an existing model based on the most important clinical and pathological factors improves the predictive accuracy of the model. This cannot be determined based on the classifier's hazard ratio or p-value in a multivariate model, but should be assessed based on an improvement in statistics such as the area under the curve and the concordance index. Finally, nomograms are superior to stage and risk group classifications for predicting outcome, but the model predicting the outcome must be well calibrated. It is important for investigators to be aware of these pitfalls in order to develop statistically valid classifiers that will truly improve our ability to predict a patient's risk of progression.
Collapse
|
738
|
Marchini S, Mariani P, Chiorino G, Marrazzo E, Bonomi R, Fruscio R, Clivio L, Garbi A, Torri V, Cinquini M, Dell'Anna T, Apolone G, Broggini M, D'Incalci M. Analysis of gene expression in early-stage ovarian cancer. Clin Cancer Res 2009; 14:7850-60. [PMID: 19047114 DOI: 10.1158/1078-0432.ccr-08-0523] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
PURPOSE Gene expression profile was analyzed in 68 stage I and 15 borderline ovarian cancers to determine if different clinical features of stage I ovarian cancer such as histotype, grade, and survival are related to differential gene expression. EXPERIMENTAL DESIGN Tumors were obtained directly at surgery and immediately frozen in liquid nitrogen until analysis. Glass arrays containing 16,000 genes were used in a dual-color assay labeling protocol. RESULTS Unsupervised analysis identified eight major patient partitions, one of which was statistically associated to overall survival, grading, and histotype and another with grading and histotype. Supervised analysis allowed detection of gene profiles clearly associated to histotype or to degree of differentiation. No difference was found between borderline and grade 1 tumors. As to recurrence, a subset of genes able to differentiate relapsers from nonrelapsers was identified. Among these, cyclin E and minichromosome maintenance protein 5 were found particularly relevant, as their expression was inversely correlated to progression-free survival (P = 0.00033 and 0.017, respectively). CONCLUSIONS Specific molecular signatures define different histotypes and prognosis of stage I ovarian cancer. Mucinous and clear cells histotypes can be distinguished from the others regardless of tumor grade. Cyclin E and minichromosome maintenance protein 5, whose expression was found previously to be related to a bad prognosis of advanced ovarian cancer, appear to be potential prognostic markers in stage I ovarian cancer too, independent of other pathologic and clinical variables.
Collapse
Affiliation(s)
- Sergio Marchini
- Department of Oncology, Istituto di Ricerche Farmacologiche Mario Negri, Milan, Italy.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
739
|
Huerta EB, Duval B, Hao JK. Fuzzy logic for elimination of redundant information of microarray data. GENOMICS PROTEOMICS & BIOINFORMATICS 2009; 6:61-73. [PMID: 18973862 PMCID: PMC5054105 DOI: 10.1016/s1672-0229(08)60021-2] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Gene subset selection is essential for classification and analysis of microarray data. However, gene selection is known to be a very difficult task since gene expression data not only have high dimensionalities, but also contain redundant information and noises. To cope with these difficulties, this paper introduces a fuzzy logic based pre-processing approach composed of two main steps. First, we use fuzzy inference rules to transform the gene expression levels of a given dataset into fuzzy values. Then we apply a similarity relation to these fuzzy values to define fuzzy equivalence groups, each group containing strongly similar genes. Dimension reduction is achieved by considering for each group of similar genes a single representative based on mutual information. To assess the usefulness of this approach, extensive experimentations were carried out on three well-known public datasets with a combined classification model using three statistic filters and three classifiers.
Collapse
|
740
|
Gene expression data classification using consensus independent component analysis. GENOMICS PROTEOMICS & BIOINFORMATICS 2009; 6:74-82. [PMID: 18973863 PMCID: PMC5054104 DOI: 10.1016/s1672-0229(08)60022-4] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
We propose a new method for tumor classification from gene expression data, which mainly contains three steps. Firstly, the original DNA microarray gene expression data are modeled by independent component analysis (ICA). Secondly, the most discriminant eigenassays extracted by ICA are selected by the sequential floating forward selection technique. Finally, support vector machine is used to classify the modeling data. To show the validity of the proposed method, we applied it to classify three DNA microarray datasets involving various human normal and tumor tissue samples. The experimental results show that the method is efficient and feasible.
Collapse
|
741
|
Abstract
Microarray data based tumor diagnosis is a very interesting topic in bioinformatics. One of the key problems is the discovery and analysis of informative genes of a tumor. Although there are many elaborate approaches to this problem, it is still difficult to select a reasonable set of informative genes for tumor diagnosis only with microarray data. In this paper, we classify the genes expressed through microarray data into a number of clusters via the distance sensitive rival penalized competitive learning (DSRPCL) algorithm and then detect the informative gene cluster or set with the help of support vector machine (SVM). Moreover, the critical or powerful informative genes can be found through further classifications and detections on the obtained informative gene clusters. It is well demonstrated by experiments on the colon, leukemia, and breast cancer datasets that our proposed DSRPCL-SVM approach leads to a reasonable selection of informative genes for tumor diagnosis.
Collapse
|
742
|
Chen PC, Huang SY, Chen WJ, Hsiao CK. A new regularized least squares support vector regression for gene selection. BMC Bioinformatics 2009; 10:44. [PMID: 19187562 PMCID: PMC2669483 DOI: 10.1186/1471-2105-10-44] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2008] [Accepted: 02/03/2009] [Indexed: 11/28/2022] Open
Abstract
Background Selection of influential genes with microarray data often faces the difficulties of a large number of genes and a relatively small group of subjects. In addition to the curse of dimensionality, many gene selection methods weight the contribution from each individual subject equally. This equal-contribution assumption cannot account for the possible dependence among subjects who associate similarly to the disease, and may restrict the selection of influential genes. Results A novel approach to gene selection is proposed based on kernel similarities and kernel weights. We do not assume uniformity for subject contribution. Weights are calculated via regularized least squares support vector regression (RLS-SVR) of class levels on kernel similarities and are used to weight subject contribution. The cumulative sum of weighted expression levels are next ranked to select responsible genes. These procedures also work for multiclass classification. We demonstrate this algorithm on acute leukemia, colon cancer, small, round blue cell tumors of childhood, breast cancer, and lung cancer studies, using kernel Fisher discriminant analysis and support vector machines as classifiers. Other procedures are compared as well. Conclusion This approach is easy to implement and fast in computation for both binary and multiclass problems. The gene set provided by the RLS-SVR weight-based approach contains a less number of genes, and achieves a higher accuracy than other procedures.
Collapse
Affiliation(s)
- Pei-Chun Chen
- 1Bioinformatics and Biostatistics Core Laboratory, National Taiwan University, Taipei, Taiwan, Republic of China.
| | | | | | | |
Collapse
|
743
|
Simultaneous cancer classification and gene selection with Bayesian nearest neighbor method: An integrated approach. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.10.012] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
744
|
Tao D, Li X, Wu X, Maybank SJ. Geometric mean for subspace selection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2009; 31:260-274. [PMID: 19110492 DOI: 10.1109/tpami.2008.70] [Citation(s) in RCA: 170] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Subspace selection approaches are powerful tools in pattern classification and data visualization. One of the most important subspace approaches is the linear dimensionality reduction step in the Fisher's linear discriminant analysis (FLDA), which has been successfully employed in many fields such as biometrics, bioinformatics, and multimedia information management. However, the linear dimensionality reduction step in FLDA has a critical drawback: for a classification task with c classes, if the dimension of the projected subspace is strictly lower than c - 1, the projection to a subspace tends to merge those classes, which are close together in the original feature space. If separate classes are sampled from Gaussian distributions, all with identical covariance matrices, then the linear dimensionality reduction step in FLDA maximizes the mean value of the Kullback-Leibler (KL) divergences between different classes. Based on this viewpoint, the geometric mean for subspace selection is studied in this paper. Three criteria are analyzed: 1) maximization of the geometric mean of the KL divergences, 2) maximization of the geometric mean of the normalized KL divergences, and 3) the combination of 1 and 2. Preliminary experimental results based on synthetic data, UCI Machine Learning Repository, and handwriting digits show that the third criterion is a potential discriminative subspace selection method, which significantly reduces the class separation problem in comparing with the linear dimensionality reduction step in FLDA and its several representative extensions.
Collapse
Affiliation(s)
- Dacheng Tao
- School of Computer Engineering, Nanyang Technological University, Singapore.
| | | | | | | |
Collapse
|
745
|
Abstract
The promise of gene expression profiling using microarray technology has inspired much new hope for finding genes involved in complications resulting from burn injury. It has become clear that complications resulting from burn injury involve collective action of many genes. Therefore, genetic dissection of burn injury should be carried out in a global context. Gene expression microarrays (GEMA) provide such global information of transcription activities of essentially all genes simultaneously. It is hoped that this promising technology can be applied to samples drawn from large-scale, well-defined clinical studies and help us untangle the web of pathways leading to complications resulting from burn injury and to the development of more effective therapies for treating burn injury. However, the extremely high dimensionality and noise inherent in GEMA data pose great challenges to identifying molecular signatures involved in burn injury. In this article, we discuss the technical challenges associated with experimental design, data analysis, and modeling gene regulatory networks. We note that while it is too early to tell how much of the enormous potential of GEMA will be realized, its success will likely depend most critically on careful experimental designs and the ability of bioinformatics to rise to the challenge of mining high dimensional GEMA data and correlating it with clinical information.
Collapse
|
746
|
Herold D, Lutter D, Schachtner R, Tome AM, Schmitz G, Lang EW. Comparison of unsupervised and supervised gene selection methods. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2009; 2008:5212-5. [PMID: 19163892 DOI: 10.1109/iembs.2008.4650389] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Modern machine learning methods based on matrix decomposition techniques like Independent Component Analysis (ICA) provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield informative expression modes (ICA) which are considered indicative of underlying regulatory processes. Their most strongly expressed genes represent marker genes for classification of the tissue samples under investigation. Comparison with supervised gene selection methods based on statistical scores or support vector machines corroborate these findings. The method is applied to macrophages loaded/de-loaded with chemically modified low density lipids.
Collapse
Affiliation(s)
- D Herold
- Institute for Biophysics, CIML Group, University of Regensburg, D-93040, Germany
| | | | | | | | | | | |
Collapse
|
747
|
Abebe A, Nudurupati SV. Rank-Based Classification Using Robust Discriminant Functions. COMMUN STAT-SIMUL C 2009. [DOI: 10.1080/03610910802446993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
748
|
Hamid JS, Hu P, Roslin NM, Ling V, Greenwood CMT, Beyene J. Data integration in genetics and genomics: methods and challenges. HUMAN GENOMICS AND PROTEOMICS : HGP 2009; 2009. [PMID: 20948564 PMCID: PMC2950414 DOI: 10.4061/2009/869093] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2008] [Accepted: 12/01/2008] [Indexed: 01/18/2023]
Abstract
Due to rapid technological advances, various types of genomic and proteomic data with different sizes, formats, and structures have become available. Among them are gene expression, single nucleotide polymorphism, copy number variation, and protein-protein/gene-gene interactions. Each of these distinct data types provides a different, partly independent and complementary, view of the whole genome. However, understanding functions of genes, proteins, and other aspects of the genome requires more information than provided by each of the datasets. Integrating data from different sources is, therefore, an important part of current research in genomics and proteomics. Data integration also plays important roles in combining clinical, environmental, and demographic data with high-throughput genomic data. Nevertheless, the concept of data integration is not well defined in the literature and it may mean different things to different researchers. In this paper, we first propose a conceptual framework for integrating genetic, genomic, and proteomic data. The framework captures fundamental aspects of data integration and is developed taking the key steps in genetic, genomic, and proteomic data fusion. Secondly, we provide a review of some of the most commonly used current methods and approaches for combining genomic data with focus on the statistical aspects.
Collapse
Affiliation(s)
- Jemila S Hamid
- Biostatistics Methodology Unit, The Hospital for Sick Children Research Institute, 555 University Avenue, Toronto, ON, Canada M5G 1X8
| | | | | | | | | | | |
Collapse
|
749
|
Zhou H, Pan W, Shen X. Penalized model-based clustering with unconstrained covariance matrices. Electron J Stat 2009; 3:1473-1496. [PMID: 20463857 DOI: 10.1214/09-ejs487] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Clustering is one of the most useful tools for high-dimensional analysis, e.g., for microarray data. It becomes challenging in presence of a large number of noise variables, which may mask underlying clustering structures. Therefore, noise removal through variable selection is necessary. One effective way is regularization for simultaneous parameter estimation and variable selection in model-based clustering. However, existing methods focus on regularizing the mean parameters representing centers of clusters, ignoring dependencies among variables within clusters, leading to incorrect orientations or shapes of the resulting clusters. In this article, we propose a regularized Gaussian mixture model permitting a treatment of general covariance matrices, taking various dependencies into account. At the same time, this approach shrinks the means and covariance matrices, achieving better clustering and variable selection. To overcome one technical challenge in estimating possibly large covariance matrices, we derive an E-M algorithm utilizing the graphical lasso (Friedman et al 2007) for parameter estimation. Numerical examples, including applications to microarray gene expression data, demonstrate the utility of the proposed method.
Collapse
Affiliation(s)
- Hui Zhou
- Division of Biostatistics, School of Public Health, University of Minnesota
| | | | | |
Collapse
|
750
|
Abstract
The development of a successful classifier from multiple predictors (analytes) is a multistage process complicated typically by the paucity of the data samples when compared to the number of available predictors. Choosing an adequate validation strategy is key for drawing sound conclusions about the usefulness of the classifier. Other important decisions have to be made regarding the type of prediction model to be used and training algorithm, as well as the way in which the markers are selected. This chapter describes the principles of the classifier development and underlines the most common pitfalls. A simulated dataset is used to illustrate the main concepts involved in supervised classification.
Collapse
|