651
|
Zhu Z, Ong YS, Zurada JM. Identification of full and partial class relevant genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:263-277. [PMID: 20431146 DOI: 10.1109/tcbb.2008.105] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Multiclass cancer classification on microarray data has provided the feasibility of cancer diagnosis across all of the common malignancies in parallel. Using multiclass cancer feature selection approaches, it is now possible to identify genes relevant to a set of cancer types. However, besides identifying the relevant genes for the set of all cancer types, it is deemed to be more informative to biologists if the relevance of each gene to specific cancer or subset of cancer types could be revealed or pinpointed. In this paper, we introduce two new definitions of multiclass relevancy features, i.e., full class relevant (FCR) and partial class relevant (PCR) features. Particularly, FCR denotes genes that serve as candidate biomarkers for discriminating all cancer types. PCR, on the other hand, are genes that distinguish subsets of cancer types. Subsequently, a Markov blanket embedded memetic algorithm is proposed for the simultaneous identification of both FCR and PCR genes. Results obtained on commonly used synthetic and real-world microarray data sets show that the proposed approach converges to valid FCR and PCR genes that would assist biologists in their research work. The identification of both FCR and PCR genes is found to generate improvement in classification accuracy on many microarray data sets. Further comparison study to existing state-of-the-art feature selection algorithms also reveals the effectiveness and efficiency of the proposed approach.
Collapse
Affiliation(s)
- Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, 345 Administration Building, Shenzhen, China.
| | | | | |
Collapse
|
652
|
Jukic DM, Rao UNM, Kelly L, Skaf JS, Drogowski LM, Kirkwood JM, Panelli MC. Microrna profiling analysis of differences between the melanoma of young adults and older adults. J Transl Med 2010; 8:27. [PMID: 20302635 PMCID: PMC2855523 DOI: 10.1186/1479-5876-8-27] [Citation(s) in RCA: 85] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2009] [Accepted: 03/19/2010] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND This study represents the first attempt to perform a profiling analysis of the intergenerational differences in the microRNAs (miRNAs) of primary cutaneous melanocytic neoplasms in young adult and older age groups. The data emphasize the importance of these master regulators in the transcriptional machinery of melanocytic neoplasms and suggest that differential levels of expressions of these miRs may contribute to differences in phenotypic and pathologic presentation of melanocytic neoplasms at different ages. METHODS An exploratory miRNA analysis of 666 miRs by low density microRNA arrays was conducted on formalin fixed and paraffin embedded tissues (FFPE) from 10 older adults and 10 young adults including conventional melanoma and melanocytic neoplasms of uncertain biological significance. Age-matched benign melanocytic nevi were used as controls. RESULTS Primary melanoma in patients greater than 60 years old was characterized by the increased expression of miRs regulating TLR-MyD88-NF-kappaB pathway (hsa-miR-199a), RAS/RAB22A pathway (hsa-miR-204); growth differentiation and migration (hsa-miR337), epithelial mesenchymal transition (EMT) (let-7b, hsa-miR-10b/10b*), invasion and metastasis (hsa-miR-10b/10b*), hsa-miR-30a/e*, hsa-miR-29c*; cellular matrix components (hsa-miR-29c*); invasion-cytokinesis (hsa-miR-99b*) compared to melanoma of younger patients. MiR-211 was dramatically downregulated compared to nevi controls, decreased with increasing age and was among the miRs linked to metastatic processes. Melanoma in young adult patients had increased expression of hsa-miR-449a and decreased expression of hsa-miR-146b, hsa-miR-214*. MiR-30a* in clinical stages I-II adult and pediatric melanoma could predict classification of melanoma tissue in the two extremes of age groups. Although the number of cases is small, positive lymph node status in the two age groups was characterized by the statistically significant expression of hsa-miR-30a* and hsa-miR-204 (F-test, p-value < 0.001). CONCLUSIONS Our findings, although preliminary, support the notion that the differential biology of melanoma at the extremes of age is driven, in part, by deregulation of microRNA expression and by fine tuning of miRs that are already known to regulate cell cycle, inflammation, Epithelial-Mesenchymal Transition (EMT)/stroma and more specifically genes known to be altered in melanoma. Our analysis reveals that miR expression differences create unique patterns of frequently affected biological processes that clearly distinguish old age from young age melanomas. This is a novel characterization of the miRnomes of melanocytic neoplasms at two extremes of age and identifies potential diagnostic and clinico-pathologic biomarkers that may serve as novel miR-based targeted modalities in melanoma diagnosis and treatment.
Collapse
Affiliation(s)
- Drazen M Jukic
- University of Pittsburgh Cancer Institute, Division of Hematology-Oncology Hillman Cancer Center, Pittsburgh, Pennsylvania, USA.
| | | | | | | | | | | | | |
Collapse
|
653
|
Abstract
The performance of many repeated tasks improves with experience and practice. This improvement tends to be rapid initially and then decreases. The term "learning curve" is often used to describe the phenomenon. In supervised machine learning, the performance of classification algorithms often increases with the number of observations used to train the algorithm. We use progressively larger samples of observations to train the algorithm and then plot performance against the number of training observations. This yields the familiar negatively accelerating learning curve. To quantify the learning curve, we fit inverse power law models to the progressively sampled data. We fit such learning curves to four large clinical cancer genomic datasets, using three classifiers (diagonal linear discriminant analysis, K-nearest-neighbor with three neighbors, and support vector machines) and four values for the number of top genes included (5, 50, 500, 5,000). The inverse power law models fit the progressively sampled data reasonably well and showed considerable diversity when multiple classifiers are applied to the same data. Some classifiers showed rapid and continued increase in performance as the number of training samples increased, while others showed little if any improvement. Assessing classifier efficiency is particularly important in genomic studies since samples are so expensive to obtain. It is important to employ an algorithm that uses the predictive information efficiently, but with a modest number of training samples (>50), learning curves can be used to assess the predictive efficiency of classification algorithms.
Collapse
Affiliation(s)
- Kenneth R Hess
- Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, TX, 77030, USA.
| | | |
Collapse
|
654
|
Sboner A, Demichelis F, Calza S, Pawitan Y, Setlur SR, Hoshida Y, Perner S, Adami HO, Fall K, Mucci LA, Kantoff PW, Stampfer M, Andersson SO, Varenhorst E, Johansson JE, Gerstein MB, Golub TR, Rubin MA, Andrén O. Molecular sampling of prostate cancer: a dilemma for predicting disease progression. BMC Med Genomics 2010; 3:8. [PMID: 20233430 PMCID: PMC2855514 DOI: 10.1186/1755-8794-3-8] [Citation(s) in RCA: 182] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2009] [Accepted: 03/16/2010] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND Current prostate cancer prognostic models are based on pre-treatment prostate specific antigen (PSA) levels, biopsy Gleason score, and clinical staging but in practice are inadequate to accurately predict disease progression. Hence, we sought to develop a molecular panel for prostate cancer progression by reasoning that molecular profiles might further improve current clinical models. METHODS We analyzed a Swedish Watchful Waiting cohort with up to 30 years of clinical follow up using a novel method for gene expression profiling. This cDNA-mediated annealing, selection, ligation, and extension (DASL) method enabled the use of formalin-fixed paraffin-embedded transurethral resection of prostate (TURP) samples taken at the time of the initial diagnosis. We determined the expression profiles of 6100 genes for 281 men divided in two extreme groups: men who died of prostate cancer and men who survived more than 10 years without metastases (lethals and indolents, respectively). Several statistical and machine learning models using clinical and molecular features were evaluated for their ability to distinguish lethal from indolent cases. RESULTS Surprisingly, none of the predictive models using molecular profiles significantly improved over models using clinical variables only. Additional computational analysis confirmed that molecular heterogeneity within both the lethal and indolent classes is widespread in prostate cancer as compared to other types of tumors. CONCLUSIONS The determination of the molecularly dominant tumor nodule may be limited by sampling at time of initial diagnosis, may not be present at time of initial diagnosis, or may occur as the disease progresses making the development of molecular biomarkers for prostate cancer progression challenging.
Collapse
Affiliation(s)
- Andrea Sboner
- Department of Pathology and Laboratory Medicine, Weill Cornell Medical Center, New York, NY, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
655
|
Joseph SJ, Robbins KR, Zhang W, Rekaya R. Comparison of two output-coding strategies for multi-class tumor classification using gene expression data and Latent Variable Model as binary classifier. Cancer Inform 2010; 9:39-48. [PMID: 20458360 PMCID: PMC2865770 DOI: 10.4137/cin.s3827] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Multi-class cancer classification based on microarray data is described. A generalized output-coding scheme based on One Versus One (OVO) combined with Latent Variable Model (LVM) is used. Results from the proposed One Versus One (OVO) outputcoding strategy is compared with the results obtained from the generalized One Versus All (OVA) method and their efficiencies of using them for multi-class tumor classification have been studied. This comparative study was done using two microarray gene expression data: Global Cancer Map (GCM) dataset and brain cancer (BC) dataset. Primary feature selection was based on fold change and penalized t-statistics. Evaluation was conducted with varying feature numbers. The OVO coding strategy worked quite well with the BC data, while both OVO and OVA results seemed to be similar for the GCM data. The selection of output coding methods for combining binary classifiers for multi-class tumor classification depends on the number of tumor types considered, the discrepancies between the tumor samples used for training as well as the heterogeneity of expression within the cancer subtypes used as training data.
Collapse
Affiliation(s)
- Sandeep J Joseph
- Rhodes Centre for Animal and Dairy Science, University of Georgia, Athens, GA 30605, USA
| | | | | | | |
Collapse
|
656
|
Pang H, Tong T, Zhao H. Shrinkage-based diagonal discriminant analysis and its applications in high-dimensional data. Biometrics 2010; 65:1021-9. [PMID: 19302409 DOI: 10.1111/j.1541-0420.2009.01200.x] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
High-dimensional data such as microarrays have brought us new statistical challenges. For example, using a large number of genes to classify samples based on a small number of microarrays remains a difficult problem. Diagonal discriminant analysis, support vector machines, and k-nearest neighbor have been suggested as among the best methods for small sample size situations, but none was found to be superior to others. In this article, we propose an improved diagonal discriminant approach through shrinkage and regularization of the variances. The performance of our new approach along with the existing methods is studied through simulations and applications to real data. These studies show that the proposed shrinkage-based and regularization diagonal discriminant methods have lower misclassification rates than existing methods in many cases.
Collapse
Affiliation(s)
- Herbert Pang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina 27705, USA.
| | | | | |
Collapse
|
657
|
Bandyopadhyay N, Kahveci T, Goodison S, Sun Y, Ranka S. Pathway-BasedFeature Selection Algorithm for Cancer Microarray Data. Adv Bioinformatics 2010; 2009:532989. [PMID: 20204186 PMCID: PMC2831238 DOI: 10.1155/2009/532989] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2009] [Accepted: 11/30/2009] [Indexed: 01/09/2023] Open
Abstract
Classification of cancers based on gene expressions produces better accuracy when compared to that of the clinical markers. Feature selection improves the accuracy of these classification algorithms by reducing the chance of overfitting that happens due to large number of features. We develop a new feature selection method called Biological Pathway-based Feature Selection (BPFS) for microarray data. Unlike most of the existing methods, our method integrates signaling and gene regulatory pathways with gene expression data to minimize the chance of overfitting of the method and to improve the test accuracy. Thus, BPFS selects a biologically meaningful feature set that is minimally redundant. Our experiments on published breast cancer datasets demonstrate that all of the top 20 genes found by our method are associated with cancer. Furthermore, the classification accuracy of our signature is up to 18% better than that of vant Veers 70 gene signature, and it is up to 8% better accuracy than the best published feature selection method, I-RELIEF.
Collapse
Affiliation(s)
- Nirmalya Bandyopadhyay
- Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Tamer Kahveci
- Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Steve Goodison
- Anderson Cancer Center Orlando, Cancer Research Institute Orlando, FL 32827, USA
| | - Y. Sun
- Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL 32611, USA
| | - Sanjay Ranka
- Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
658
|
Das M, Reichman JR, Haberer G, Welzl G, Aceituno FF, Mader MT, Watrud LS, Pfleeger TG, Gutiérrez RA, Schäffner AR, Olszyk DM. A composite transcriptional signature differentiates responses towards closely related herbicides in Arabidopsis thaliana and Brassica napus. PLANT MOLECULAR BIOLOGY 2010; 72:545-56. [PMID: 20043233 PMCID: PMC2816244 DOI: 10.1007/s11103-009-9590-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/23/2009] [Accepted: 12/10/2009] [Indexed: 05/04/2023]
Abstract
In this study, genome-wide expression profiling based on Affymetrix ATH1 arrays was used to identify discriminating responses of Arabidopsis thaliana to five herbicides, which contain active ingredients targeting two different branches of amino acid biosynthesis. One herbicide contained glyphosate, which targets 5-enolpyruvylshikimate-3-phosphate synthase (EPSPS), while the other four herbicides contain different acetolactate synthase (ALS) inhibiting compounds. In contrast to the herbicide containing glyphosate, which affected only a few transcripts, many effects of the ALS inhibiting herbicides were revealed based on transcriptional changes related to ribosome biogenesis and translation, secondary metabolism, cell wall modification and growth. The expression pattern of a set of 101 genes provided a specific, composite signature that was distinct from other major stress responses and differentiated among herbicides targeting the same enzyme (ALS) or containing the same chemical class of active ingredient (sulfonylurea). A set of homologous genes could be identified in Brassica napus that exhibited a similar expression pattern and correctly distinguished exposure to the five herbicides. Our results show the ability of a limited number of genes to classify and differentiate responses to closely related herbicides in A. thaliana and B. napus and the transferability of a complex transcriptional signature across species.
Collapse
Affiliation(s)
- Malay Das
- National Health and Environmental Effects Research Laboratory, Western Ecology Division, U.S. Environmental Protection Agency, Office of Research and Development, Corvallis, OR 97333 USA
| | - Jay R. Reichman
- National Health and Environmental Effects Research Laboratory, Western Ecology Division, U.S. Environmental Protection Agency, Office of Research and Development, Corvallis, OR 97333 USA
| | - Georg Haberer
- Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, 85764 Neuherberg, Germany
| | - Gerhard Welzl
- Institute of Developmental Genetics, Helmholtz Zentrum München, German Research Center for Environmental Health, 85764 Neuherberg, Germany
| | - Felipe F. Aceituno
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Santiago, Chile
| | - Michael T. Mader
- Institute of Stem Cell Research, Helmholtz Zentrum München, German Research Center for Environmental Health, 85764 Neuherberg, Germany
| | - Lidia S. Watrud
- National Health and Environmental Effects Research Laboratory, Western Ecology Division, U.S. Environmental Protection Agency, Office of Research and Development, Corvallis, OR 97333 USA
| | - Thomas G. Pfleeger
- National Health and Environmental Effects Research Laboratory, Western Ecology Division, U.S. Environmental Protection Agency, Office of Research and Development, Corvallis, OR 97333 USA
| | - Rodrigo A. Gutiérrez
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Santiago, Chile
| | - Anton R. Schäffner
- Institute of Biochemical Plant Pathology, Helmholtz Zentrum München, German Research Center for Environmental Health, 85764 Neuherberg, Germany
| | - David M. Olszyk
- National Health and Environmental Effects Research Laboratory, Western Ecology Division, U.S. Environmental Protection Agency, Office of Research and Development, Corvallis, OR 97333 USA
| |
Collapse
|
659
|
Qiao X, Zhang HH, Liu Y, Todd MJ, Marron JS. Weighted Distance Weighted Discrimination and Its Asymptotic Properties. J Am Stat Assoc 2010; 105:401-414. [PMID: 21152360 PMCID: PMC2996856 DOI: 10.1198/jasa.2010.tm08487] [Citation(s) in RCA: 70] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
While Distance Weighted Discrimination (DWD) is an appealing approach to classification in high dimensions, it was designed for balanced datasets. In the case of unequal costs, biased sampling, or unbalanced data, there are major improvements available, using appropriately weighted versions of DWD (wDWD). A major contribution of this paper is the development of optimal weighting schemes for various nonstandard classification problems. In addition, we discuss several alternative criteria and propose an adaptive weighting scheme (awDWD) and demonstrate its advantages over nonadaptive weighting schemes under some situations. The second major contribution is a theoretical study of weighted DWD. Both high-dimensional low sample-size asymptotics and Fisher consistency of DWD are studied. The performance of weighted DWD is evaluated using simulated examples and two real data examples. The theoretical results are also confirmed by simulations.
Collapse
Affiliation(s)
- Xingye Qiao
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599
| | - Hao Helen Zhang
- Department of Statistics, North Carolina State University, Raleigh, NC 27695
| | - Yufeng Liu
- Department of Statistics and Operations Research, Carolina Center for Genome Sciences, University of North Carolina, Chapel Hill, NC 27599
| | - Michael J. Todd
- School of Operations Research and Information Engineering, Cornell University, Ithaca, NY 14853
| | - J. S. Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599
| |
Collapse
|
660
|
Use of DNA-damaging agents and RNA pooling to assess expression profiles associated with BRCA1 and BRCA2 mutation status in familial breast cancer patients. PLoS Genet 2010; 6:e1000850. [PMID: 20174566 PMCID: PMC2824809 DOI: 10.1371/journal.pgen.1000850] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2009] [Accepted: 01/19/2010] [Indexed: 01/12/2023] Open
Abstract
A large number of rare sequence variants of unknown clinical significance have been identified in the breast cancer susceptibility genes, BRCA1 and BRCA2. Laboratory-based methods that can distinguish between carriers of pathogenic mutations and non-carriers are likely to have utility for the classification of these sequence variants. To identify predictors of pathogenic mutation status in familial breast cancer patients, we explored the use of gene expression arrays to assess the effect of two DNA–damaging agents (irradiation and mitomycin C) on cellular response in relation to BRCA1 and BRCA2 mutation status. A range of regimes was used to treat 27 lymphoblastoid cell-lines (LCLs) derived from affected women in high-risk breast cancer families (nine BRCA1, nine BRCA2, and nine non-BRCA1/2 or BRCAX individuals) and nine LCLs from healthy individuals. Using an RNA–pooling strategy, we found that treating LCLs with 1.2 µM mitomycin C and measuring the gene expression profiles 1 hour post-treatment had the greatest potential to discriminate BRCA1, BRCA2, and BRCAX mutation status. A classifier was built using the expression profile of nine QRT–PCR validated genes that were associated with BRCA1, BRCA2, and BRCAX status in RNA pools. These nine genes could distinguish BRCA1 from BRCA2 carriers with 83% accuracy in individual samples, but three-way analysis for BRCA1, BRCA2, and BRCAX had a maximum of 59% prediction accuracy. Our results suggest that, compared to BRCA1 and BRCA2 mutation carriers, non-BRCA1/2 (BRCAX) individuals are genetically heterogeneous. This study also demonstrates the effectiveness of RNA pools to compare the expression profiles of cell-lines from BRCA1, BRCA2, and BRCAX cases after treatment with irradiation and mitomycin C as a method to prioritize treatment regimes for detailed downstream expression analysis. A large number of rare sequence variants of unknown clinical significance have been identified in the breast cancer susceptibility genes, BRCA1 and BRCA2. Laboratory methods to identify which of these variants are mutations would have utility for counseling and clinical decision making when identified in patients with a family history of breast cancer. We used DNA–damaging agents to disturb gene expression profiles of cell-lines derived from blood of patients, and we compared patterns from women with BRCA1 and BRCA2 mutations to women familial breast cancer families without such mutations. Using a pooling strategy, which allowed us to compare several treatments at one time, we identified which treatment caused the greatest difference in gene-expression changes between patient groups and used this treatment method for further study. We were able to accurately classify BRCA1 and BRCA2 samples, and our results supported other reported findings that suggested familial breast cancer patients without BRCA1/2 mutations are genetically heterogeneous. We demonstrate a useful strategy to identify treatments that induce gene expression differences associated with BRCA1/2 mutation status. This strategy may aid the development of a molecular-based tool to screen individuals from multi-case breast cancer families for the presence of pathogenic mutations.
Collapse
|
661
|
Mapping multi-class cancers and clinical outcomes prediction for multiple classifications of microarray gene expression data. KOREAN J CHEM ENG 2010. [DOI: 10.1007/s11814-009-0161-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
662
|
Mattila P, Renkonen J, Toppila-Salmi S, Parviainen V, Joenväärä S, Alff-Tuomala S, Nicorici D, Renkonen R. Time-series nasal epithelial transcriptomics during natural pollen exposure in healthy subjects and allergic patients. Allergy 2010; 65:175-83. [PMID: 19804444 DOI: 10.1111/j.1398-9995.2009.02181.x] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
BACKGROUND The role of epithelium has recently awakened interest in the studies of type I hypersensitivity. OBJECTIVE We analysed the nasal transcriptomics epithelial response to natural birch pollen exposure in a time series manner. METHODS Human nasal epithelial cell swabs were collected from birch pollen allergic patients and healthy controls in winter season. In addition, four specimens at weekly intervals were collected from the same subjects during natural birch pollen exposure in spring and transcriptomic analyses were performed. RESULTS The nasal epithelium of healthy subjects responded vigorously to allergen exposure. The immune response was a dominating category of this response. Notably, the healthy subjects did not display any clinical symptoms regardless of this response detected by transcriptomic analysis. Concomitantly, the epithelium of allergic subjects responded also, but with a different set of responders. In allergic patients the regulation of dyneins, the molecular motors of intracellular transport dominated. This further supports our previous hypothesis that the birch pollen exposure results in an active uptake of allergen into the epithelium only in allergic subjects but not in healthy controls. CONCLUSION We showed that birch pollen allergen causes a defence response in healthy subjects, but not in allergic subjects. Instead, allergic patients actively transport pollen allergen through the epithelium to tissue mast cells. Our study showed that new hypotheses can arise from the application of discovery driven methodologies. To understand complex multifactorial diseases, such as type I hypersensitivity, this kind of hypotheses might be worth further analyses.
Collapse
|
663
|
Yanofsky CM, Bickel DR. Validation of differential gene expression algorithms: application comparing fold-change estimation to hypothesis testing. BMC Bioinformatics 2010; 11:63. [PMID: 20109217 PMCID: PMC3224549 DOI: 10.1186/1471-2105-11-63] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2009] [Accepted: 01/28/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Sustained research on the problem of determining which genes are differentially expressed on the basis of microarray data has yielded a plethora of statistical algorithms, each justified by theory, simulation, or ad hoc validation and yet differing in practical results from equally justified algorithms. Recently, a concordance method that measures agreement among gene lists have been introduced to assess various aspects of differential gene expression detection. This method has the advantage of basing its assessment solely on the results of real data analyses, but as it requires examining gene lists of given sizes, it may be unstable. RESULTS Two methodologies for assessing predictive error are described: a cross-validation method and a posterior predictive method. As a nonparametric method of estimating prediction error from observed expression levels, cross validation provides an empirical approach to assessing algorithms for detecting differential gene expression that is fully justified for large numbers of biological replicates. Because it leverages the knowledge that only a small portion of genes are differentially expressed, the posterior predictive method is expected to provide more reliable estimates of algorithm performance, allaying concerns about limited biological replication. In practice, the posterior predictive method can assess when its approximations are valid and when they are inaccurate. Under conditions in which its approximations are valid, it corroborates the results of cross validation. Both comparison methodologies are applicable to both single-channel and dual-channel microarrays. For the data sets considered, estimating prediction error by cross validation demonstrates that empirical Bayes methods based on hierarchical models tend to outperform algorithms based on selecting genes by their fold changes or by non-hierarchical model-selection criteria. (The latter two approaches have comparable performance.) The posterior predictive assessment corroborates these findings. CONCLUSIONS Algorithms for detecting differential gene expression may be compared by estimating each algorithm's error in predicting expression ratios, whether such ratios are defined across microarray channels or between two independent groups.According to two distinct estimators of prediction error, algorithms using hierarchical models outperform the other algorithms of the study. The fact that fold-change shrinkage performed as well as conventional model selection criteria calls for investigating algorithms that combine the strengths of significance testing and fold-change estimation.
Collapse
Affiliation(s)
- Corey M Yanofsky
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology, and Immunology, University of Ottawa, Ottawa, Ontario, Canada
| | | |
Collapse
|
664
|
Kim SK, Kim EJ, Leem SH, Ha YS, Kim YJ, Kim WJ. Identification of S100A8-correlated genes for prediction of disease progression in non-muscle invasive bladder cancer. BMC Cancer 2010; 10:21. [PMID: 20096140 PMCID: PMC2828413 DOI: 10.1186/1471-2407-10-21] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2009] [Accepted: 01/25/2010] [Indexed: 11/12/2022] Open
Abstract
Background S100 calcium binding protein A8 (S100A8) has been implicated as a prognostic indicator in several types of cancer. However, previous studies are limited in their ability to predict the clinical behavior of the cancer. Here, we sought to identify a molecular signature based on S100A8 expression and to assess its usefulness as a prognostic indicator of disease progression in non-muscle invasive bladder cancer (NMIBC). Methods We used 103 primary NMIBC specimens for microarray gene expression profiling. The median follow-up period for all patients was 57.6 months (range: 3.2 to 137.0 months). Various statistical methods, including the leave-one-out cross validation method, were applied to identify a gene expression signature able to predict the likelihood of progression. The prognostic value of the gene expression signature was validated in an independent cohort (n = 302). Results Kaplan-Meier estimates revealed significant differences in disease progression associated with the expression signature of S100A8-correlated genes (log-rank test, P < 0.001). Multivariate Cox regression analysis revealed that the expression signature of S100A8-correlated genes was a strong predictor of disease progression (hazard ratio = 15.225, 95% confidence interval = 1.746 to 133.52, P = 0.014). We validated our results in an independent cohort and confirmed that this signature produced consistent prediction patterns. Finally, gene network analyses of the signature revealed that S100A8, IL1B, and S100A9 could be important mediators of the progression of NMIBC. Conclusions The prognostic molecular signature defined by S100A8-correlated genes represents a promising diagnostic tool for the identification of NMIBC patients that have a high risk of progression to muscle invasive bladder cancer.
Collapse
Affiliation(s)
- Seon-Kyu Kim
- Department of Urology, College of Medicine, Chungbuk National University, Cheongju, Chungbuk, South Korea
| | | | | | | | | | | |
Collapse
|
665
|
Yang Y, Kort EJ, Ebrahimi N, Zhang Z, Teh BT. Dual KS: Defining Gene Sets with Tissue Set Enrichment Analysis. Cancer Inform 2010; 9:1-9. [PMID: 20148167 PMCID: PMC2816930 DOI: 10.4137/cin.s2892] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Gene set enrichment analysis (GSEA) is an analytic approach which simultaneously reduces the dimensionality of microarray data and enables ready inference of the biological meaning of observed gene expression patterns. Here we invert the GSEA process to identify class-specific gene signatures. Because our approach uses the Kolmogorov-Smirnov approach both to define class specific signatures and to classify samples using those signatures, we have termed this methodology "Dual-KS" (DKS). RESULTS The optimum gene signature identified by the DKS algorithm was smaller than other methods to which it was compared in 5 out of 10 datasets. The estimated error rate of DKS using the optimum gene signature was smaller than the estimated error rate of the random forest method in 4 out of the 10 datasets, and was equivalent in two additional datasets. DKS performance relative to other benchmarked algorithms was similar to its performance relative to random forests. CONCLUSIONS DKS is an efficient analytic methodology that can identify highly parsimonious gene signatures useful for classification in the context of microarray studies. The algorithm is available as the dualKS package for R as part of the bioconductor project.
Collapse
Affiliation(s)
- Yarong Yang
- These authors contributed equally to this work
| | | | | | | | | |
Collapse
|
666
|
Yang P, Zhou BB, Zhang Z, Zomaya AY. A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data. BMC Bioinformatics 2010; 11 Suppl 1:S5. [PMID: 20122224 PMCID: PMC3009522 DOI: 10.1186/1471-2105-11-s1-s5] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
Background Feature selection techniques are critical to the analysis of high dimensional datasets. This is especially true in gene selection from microarray data which are commonly with extremely high feature-to-sample ratio. In addition to the essential objectives such as to reduce data noise, to reduce data redundancy, to improve sample classification accuracy, and to improve model generalization property, feature selection also helps biologists to focus on the selected genes to further validate their biological hypotheses. Results In this paper we describe an improved hybrid system for gene selection. It is based on a recently proposed genetic ensemble (GE) system. To enhance the generalization property of the selected genes or gene subsets and to overcome the overfitting problem of the GE system, we devised a mapping strategy to fuse the goodness information of each gene provided by multiple filtering algorithms. This information is then used for initialization and mutation operation of the genetic ensemble system. Conclusion We used four benchmark microarray datasets (including both binary-class and multi-class classification problems) for concept proving and model evaluation. The experimental results indicate that the proposed multi-filter enhanced genetic ensemble (MF-GE) system is able to improve sample classification accuracy, generate more compact gene subset, and converge to the selection results more quickly. The MF-GE system is very flexible as various combinations of multiple filters and classifiers can be incorporated based on the data characteristics and the user preferences.
Collapse
Affiliation(s)
- Pengyi Yang
- School of Information Technologies (J12), The University of Sydney, NSW 2006, Australia.
| | | | | | | |
Collapse
|
667
|
López-Pintado S, Romo J, Torrente A. Robust depth-based tools for the analysis of gene expression data. Biostatistics 2010; 11:254-64. [PMID: 20064844 DOI: 10.1093/biostatistics/kxp056] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Microarray experiments provide data on the expression levels of thousands of genes and, therefore, statistical methods applicable to the analysis of such high-dimensional data are needed. In this paper, we propose robust nonparametric tools for the description and analysis of microarray data based on the concept of functional depth, which measures the centrality of an observation within a sample. We show that this concept can be easily adapted to high-dimensional observations and, in particular, to gene expression data. This allows the development of the following depth-based inference tools: (1) a scale curve for measuring and visualizing the dispersion of a set of points, (2) a rank test for deciding if 2 groups of multidimensional observations come from the same population, and (3) supervised classification techniques for assigning a new sample to one of G given groups. We apply these methods to microarray data, and to simulated data including contaminated models, and show that they are robust, efficient, and competitive with other procedures proposed in the literature, outperforming them in some situations.
Collapse
Affiliation(s)
- Sara López-Pintado
- Departamento de Economía, Métodos Cuantitativos e Historia Económica, Universidad Pablo de Olavide, Seville, Spain.
| | | | | |
Collapse
|
668
|
Popovici V, Chen W, Gallas BG, Hatzis C, Shi W, Samuelson FW, Nikolsky Y, Tsyganova M, Ishkin A, Nikolskaya T, Hess KR, Valero V, Booser D, Delorenzi M, Hortobagyi GN, Shi L, Symmans WF, Pusztai L. Effect of training-sample size and classification difficulty on the accuracy of genomic predictors. Breast Cancer Res 2010; 12:R5. [PMID: 20064235 PMCID: PMC2880423 DOI: 10.1186/bcr2468] [Citation(s) in RCA: 139] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2009] [Revised: 12/18/2009] [Accepted: 01/11/2010] [Indexed: 12/31/2022] Open
Abstract
Introduction As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. Methods We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. Results A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. Conclusions We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.
Collapse
Affiliation(s)
- Vlad Popovici
- Bioinformatics Core Facility, Swiss Institute of Bioinformatics, Génopode Building, Quartier Sorge, Lausanne CH-1015, Switzerland
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
669
|
Yao B, Li S. ANMM4CBR: a case-based reasoning method for gene expression data classification. Algorithms Mol Biol 2010; 5:14. [PMID: 20051140 PMCID: PMC2843690 DOI: 10.1186/1748-7188-5-14] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2009] [Accepted: 01/06/2010] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND Accurate classification of microarray data is critical for successful clinical diagnosis and treatment. The "curse of dimensionality" problem and noise in the data, however, undermines the performance of many algorithms. METHOD In order to obtain a robust classifier, a novel Additive Nonparametric Margin Maximum for Case-Based Reasoning (ANMM4CBR) method is proposed in this article. ANMM4CBR employs a case-based reasoning (CBR) method for classification. CBR is a suitable paradigm for microarray analysis, where the rules that define the domain knowledge are difficult to obtain because usually only a small number of training samples are available. Moreover, in order to select the most informative genes, we propose to perform feature selection via additively optimizing a nonparametric margin maximum criterion, which is defined based on gene pre-selection and sample clustering. Our feature selection method is very robust to noise in the data. RESULTS The effectiveness of our method is demonstrated on both simulated and real data sets. We show that the ANMM4CBR method performs better than some state-of-the-art methods such as support vector machine (SVM) and k nearest neighbor (kNN), especially when the data contains a high level of noise. AVAILABILITY The source code is attached as an additional file of this paper.
Collapse
Affiliation(s)
- Bangpeng Yao
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, PR China
| | - Shao Li
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, PR China
| |
Collapse
|
670
|
Wu PS, Müller HG. Functional embedding for the classification of gene expression profiles. Bioinformatics 2010; 26:509-17. [PMID: 20053838 DOI: 10.1093/bioinformatics/btp711] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Low sample size n high-dimensional large p data with n<<p are commonly encountered in genomics and statistical genetics. Ill-conditioning of the variance-covariance matrix for such data renders the traditional multivariate data analytical approaches unattractive. On the other side, functional data analysis (FDA) approaches are designed for infinite-dimensional data and therefore may have potential for the analysis of large p data. We herein propose a functional embedding (FEM) technique, which exploits the interface between multivariate and functional data, aiming at borrowing strength across the sample through FDA techniques in order to resolve the difficulties caused by the high dimension p. RESULTS Using pairwise dissimilarities among predictor variables, one obtains a univariate configuration of these covariates. This is interpreted as variable ordination that defines the domain of a suitable function space, thus leading to the FEM of the high-dimensional data. The embedding may then be followed by functional logistic regression for the classification of high-dimensional multivariate data as an example for downstream analysis. The resulting functional classification is evaluated on several published gene expression array datasets and a mass spectrometric data, and is shown to compare favorably with various methods that have been employed previously for the classification of these high-dimensional gene expression profiles.
Collapse
Affiliation(s)
- Ping-Shi Wu
- Department of Mathematics, Lehigh University, Bethlehem, PA 18015, USA.
| | | |
Collapse
|
671
|
Schaefer G, Nakashima T. Data Mining of Gene Expression Data by Fuzzy and Hybrid Fuzzy Methods. ACTA ACUST UNITED AC 2010; 14:23-9. [PMID: 19846381 DOI: 10.1109/titb.2009.2033590] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Gerald Schaefer
- Department of Computer Science, Loughborough University, Loughborough LE11 3TU, UK.
| | | |
Collapse
|
672
|
Zhu S, Wang D, Yu K, Li T, Gong Y. Feature selection for gene expression using model-based entropy. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:25-36. [PMID: 20150666 DOI: 10.1109/tcbb.2008.35] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Gene expression data usually contain a large number of genes but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection based on empirical mutual information suffers the data sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a model-based approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy among all real-valued distributions with a specified mean and standard deviation and are widely used to approximate various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of class variables given the selected features is a normal distribution, its entropy can be computed with the log-determinant of its covariance matrix. Because of the large number of genes, the computation of all possible log-determinants is not efficient. We propose several algorithms to largely reduce the computational cost. The experiments on seven gene data sets and the comparison with other five approaches show the accuracy of the multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms.
Collapse
Affiliation(s)
- Shenghuo Zhu
- NEC Laboratories America, 10080 North Wolfe Road, Cupertino, CA 95014, USA.
| | | | | | | | | |
Collapse
|
673
|
Linear Discriminant Analysis with more Variables than Observations: A not so Naive Approach. STUDIES IN CLASSIFICATION, DATA ANALYSIS, AND KNOWLEDGE ORGANIZATION 2010. [DOI: 10.1007/978-3-642-10745-0_24] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/10/2023]
|
674
|
Sun M. Linear Programming Approaches for Multiple-Class Discriminant and Classification Analysis. ACTA ACUST UNITED AC 2010. [DOI: 10.4018/jsds.2010103004] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
New linear programming approaches are proposed as nonparametric procedures for multiple-class discriminant and classification analysis. A new MSD model minimizing the sum of the classification errors is formulated to construct discriminant functions. This model has desirable properties because it is versatile and is immune to the pathologies of some of the earlier mathematical programming models for two-class classification. It is also purely systematic and algorithmic and no user ad hoc and trial judgment is required. Furthermore, it can be used as the basis to develop other models, such as a multiple-class support vector machine and a mixed integer programming model, for discrimination and classification. A MMD model minimizing the maximum of the classification errors, although with very limited use, is also studied. These models may also be considered as generalizations of mathematical programming formulations for two-class classification. By the same approach, other mathematical programming formulations for two-class classification can be easily generalized to multiple-class formulations. Results on standard as well as randomly generated test datasets show that the MSD model is very effective in generating powerful discriminant functions.
Collapse
|
675
|
Deng X, Campagne F. Introduction to the development and validation of predictive biomarker models from high-throughput data sets. Methods Mol Biol 2010; 620:435-470. [PMID: 20652515 DOI: 10.1007/978-1-60761-580-4_15] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
High-throughput technologies can routinely assay biological or clinical samples and produce wide data sets where each sample is associated with tens of thousands of measurements. Such data sets can be mined to discover biomarkers and develop statistical models capable of predicting an endpoint of interest from data measured in the samples. The field of biomarker model development combines methods from statistics and machine learning to develop and evaluate predictive biomarker models. In this chapter, we discuss the computational steps involved in the development of biomarker models designed to predict information about individual samples and review approaches often used to implement each step. A practical example of biomarker model development in a large gene expression data set is presented. This example leverages BDVal, a suite of biomarker model development programs developed as an open-source project (see http://bdval.org /).
Collapse
Affiliation(s)
- Xutao Deng
- HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Medical College of Cornell University, New York, NY, USA
| | | |
Collapse
|
676
|
Oliveri P, Casolino MC, Forina M. Chemometric brains for artificial tongues. ADVANCES IN FOOD AND NUTRITION RESEARCH 2010; 61:57-117. [PMID: 21092902 DOI: 10.1016/b978-0-12-374468-5.00002-7] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The last years showed a significant trend toward the exploitation of rapid and economic analytical devices able to provide multiple information about samples. Among these, the so-called artificial tongues represent effective tools which allow a global sample characterization comparable to a fingerprint. Born as taste sensors for food evaluation, such devices proved to be useful for a wider number of purposes. In this review, a critical overview of artificial tongue applications over the last decade is outlined. In particular, the focus is centered on the chemometric techniques, which allow the extraction of valuable information from nonspecific data. The basic steps of signal processing and pattern recognition are discussed and the principal chemometric techniques are described in detail, highlighting benefits and drawbacks of each one. Furthermore, some novel methods recently introduced and particularly suitable for artificial tongue data are presented.
Collapse
Affiliation(s)
- Paolo Oliveri
- Department of Drug and Food Chemistry and Technology, University of Genoa, Genoa, Italy.
| | | | | |
Collapse
|
677
|
Boulesteix AL, Strobl C. Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC Med Res Methodol 2009; 9:85. [PMID: 20025773 PMCID: PMC2813849 DOI: 10.1186/1471-2288-9-85] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2008] [Accepted: 12/21/2009] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias. METHODS In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. RESULTS We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly. CONCLUSIONS The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on permuted uninformative predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Statistics, University of Munich, Ludwigstr 33, D-80539 Munich, Germany
- Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr 1, D-81677 Munich, Germany
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr 15, D-81377 Munich, Germany
| | - Carolin Strobl
- Department of Statistics, University of Munich, Ludwigstr 33, D-80539 Munich, Germany
| |
Collapse
|
678
|
Takeno A, Takemasa I, Seno S, Yamasaki M, Motoori M, Miyata H, Nakajima K, Takiguchi S, Fujiwara Y, Nishida T, Okayama T, Matsubara K, Takenaka Y, Matsuda H, Monden M, Mori M, Doki Y. Gene Expression Profile Prospectively Predicts Peritoneal Relapse After Curative Surgery of Gastric Cancer. Ann Surg Oncol 2009; 17:1033-42. [DOI: 10.1245/s10434-009-0854-1] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2009] [Indexed: 01/24/2023]
|
679
|
Huang D, Quan Y, He M, Zhou B. Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data. JOURNAL OF EXPERIMENTAL & CLINICAL CANCER RESEARCH : CR 2009; 28:149. [PMID: 20003274 PMCID: PMC2800110 DOI: 10.1186/1756-9966-28-149] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2009] [Accepted: 12/10/2009] [Indexed: 11/14/2022]
Abstract
Background More studies based on gene expression data have been reported in great detail, however, one major challenge for the methodologists is the choice of classification methods. The main purpose of this research was to compare the performance of linear discriminant analysis (LDA) and its modification methods for the classification of cancer based on gene expression data. Methods The classification performance of linear discriminant analysis (LDA) and its modification methods was evaluated by applying these methods to six public cancer gene expression datasets. These methods included linear discriminant analysis (LDA), prediction analysis for microarrays (PAM), shrinkage centroid regularized discriminant analysis (SCRDA), shrinkage linear discriminant analysis (SLDA) and shrinkage diagonal discriminant analysis (SDDA). The procedures were performed by software R 2.80. Results PAM picked out fewer feature genes than other methods from most datasets except from Brain dataset. For the two methods of shrinkage discriminant analysis, SLDA selected more genes than SDDA from most datasets except from 2-class lung cancer dataset. When comparing SLDA with SCRDA, SLDA selected more genes than SCRDA from 2-class lung cancer, SRBCT and Brain dataset, the result was opposite for the rest datasets. The average test error of LDA modification methods was lower than LDA method. Conclusions The classification performance of LDA modification methods was superior to that of traditional LDA with respect to the average error and there was no significant difference between theses modification methods.
Collapse
Affiliation(s)
- Desheng Huang
- Department of Mathematics, College of Basic Medical Sciences, China Medical University, and Computer Center, Affiliated Shenjing Hospital, Shenyang, China.
| | | | | | | |
Collapse
|
680
|
Kandaswamy KK, Pugalenthi G, Hartmann E, Kalies KU, Möller S, Suganthan PN, Martinetz T. SPRED: A machine learning approach for the identification of classical and non-classical secretory proteins in mammalian genomes. Biochem Biophys Res Commun 2009; 391:1306-11. [PMID: 19995554 DOI: 10.1016/j.bbrc.2009.12.019] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2009] [Accepted: 12/03/2009] [Indexed: 10/20/2022]
Abstract
Eukaryotic protein secretion generally occurs via the classical secretory pathway that traverses the ER and Golgi apparatus. Secreted proteins usually contain a signal sequence with all the essential information required to target them for secretion. However, some proteins like fibroblast growth factors (FGF-1, FGF-2), interleukins (IL-1 alpha, IL-1 beta), galectins and thioredoxin are exported by an alternative pathway. This is known as leaderless or non-classical secretion and works without a signal sequence. Most computational methods for the identification of secretory proteins use the signal peptide as indicator and are therefore not able to identify substrates of non-classical secretion. In this work, we report a random forest method, SPRED, to identify secretory proteins from protein sequences irrespective of N-terminal signal peptides, thus allowing also correct classification of non-classical secretory proteins. Training was performed on a dataset containing 600 extracellular proteins and 600 cytoplasmic and/or nuclear proteins. The algorithm was tested on 180 extracellular proteins and 1380 cytoplasmic and/or nuclear proteins. We obtained 85.92% accuracy from training and 82.18% accuracy from testing. Since SPRED does not use N-terminal signals, it can detect non-classical secreted proteins by filtering those secreted proteins with an N-terminal signal by using SignalP. SPRED predicted 15 out of 19 experimentally verified non-classical secretory proteins. By scanning the entire human proteome we identified 566 protein sequences potentially undergoing non-classical secretion. The dataset and standalone version of the SPRED software is available at http://www.inb.uni-luebeck.de/tools-demos/spred/spred.
Collapse
|
681
|
|
682
|
Advances in clinical trial designs for predictive biomarker discovery and validation. CURRENT BREAST CANCER REPORTS 2009. [DOI: 10.1007/s12609-009-0030-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
683
|
Sontrop HMJ, Moerland PD, van den Ham R, Reinders MJT, Verhaegh WFJ. A comprehensive sensitivity analysis of microarray breast cancer classification under feature variability. BMC Bioinformatics 2009; 10:389. [PMID: 19941644 PMCID: PMC2789744 DOI: 10.1186/1471-2105-10-389] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2009] [Accepted: 11/26/2009] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Large discrepancies in signature composition and outcome concordance have been observed between different microarray breast cancer expression profiling studies. This is often ascribed to differences in array platform as well as biological variability. We conjecture that other reasons for the observed discrepancies are the measurement error associated with each feature and the choice of preprocessing method. Microarray data are known to be subject to technical variation and the confidence intervals around individual point estimates of expression levels can be wide. Furthermore, the estimated expression values also vary depending on the selected preprocessing scheme. In microarray breast cancer classification studies, however, these two forms of feature variability are almost always ignored and hence their exact role is unclear. RESULTS We have performed a comprehensive sensitivity analysis of microarray breast cancer classification under the two types of feature variability mentioned above. We used data from six state of the art preprocessing methods, using a compendium consisting of eight different datasets, involving 1131 hybridizations, containing data from both one and two-color array technology. For a wide range of classifiers, we performed a joint study on performance, concordance and stability. In the stability analysis we explicitly tested classifiers for their noise tolerance by using perturbed expression profiles that are based on uncertainty information directly related to the preprocessing methods. Our results indicate that signature composition is strongly influenced by feature variability, even if the array platform and the stratification of patient samples are identical. In addition, we show that there is often a high level of discordance between individual class assignments for signatures constructed on data coming from different preprocessing schemes, even if the actual signature composition is identical. CONCLUSION Feature variability can have a strong impact on breast cancer signature composition, as well as the classification of individual patient samples. We therefore strongly recommend that feature variability is considered in analyzing data from microarray breast cancer expression profiling experiments.
Collapse
|
684
|
Ai-Jun Y, Xin-Yuan S. Bayesian variable selection for disease classification using gene expression data. Bioinformatics 2009; 26:215-22. [DOI: 10.1093/bioinformatics/btp638] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
685
|
Hu P, Greenwood CMT, Beyene J. Using the ratio of means as the effect size measure in combining results of microarray experiments. BMC SYSTEMS BIOLOGY 2009; 3:106. [PMID: 19891778 PMCID: PMC2784452 DOI: 10.1186/1752-0509-3-106] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/26/2009] [Accepted: 11/05/2009] [Indexed: 12/19/2022]
Abstract
Background Development of efficient analytic methodologies for combining microarray results is a major challenge in gene expression analysis. The widely used effect size models are thought to provide an efficient modeling framework for this purpose, where the measures of association for each study and each gene are combined, weighted by the standard errors. A significant disadvantage of this strategy is that the quality of different data sets may be highly variable, but this information is usually neglected during the integration. Moreover, it is widely known that the estimated standard deviations are probably unstable in the commonly used effect size measures (such as standardized mean difference) when sample sizes in each group are small. Results We propose a re-parameterization of the traditional mean difference based effect measure by using the log ratio of means as an effect size measure for each gene in each study. The estimated effect sizes for all studies were then combined under two modeling frameworks: the quality-unweighted random effects models and the quality-weighted random effects models. We defined the quality measure as a function of the detection p-value, which indicates whether a transcript is reliably detected or not on the Affymetrix gene chip. The new effect size measure is evaluated and compared under the quality-weighted and quality-unweighted data integration frameworks using simulated data sets, and also in several data sets of prostate cancer patients and controls. We focus on identifying differentially expressed biomarkers for prediction of cancer outcomes. Conclusion Our results show that the proposed effect size measure (log ratio of means) has better power to identify differentially expressed genes, and that the detected genes have better performance in predicting cancer outcomes than the commonly used effect size measure, the standardized mean difference (SMD), under both quality-weighted and quality-unweighted data integration frameworks. The new effect size measure and the quality-weighted microarray data integration framework provide efficient ways to combine microarray results.
Collapse
Affiliation(s)
- Pingzhao Hu
- The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON, Canada.
| | | | | |
Collapse
|
686
|
Mary-Huard T, Robin S. Tailored aggregation for classification. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2009; 31:2098-2105. [PMID: 19762936 DOI: 10.1109/tpami.2009.55] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Compression and variable selection are two classical strategies to deal with large-dimension data sets in classification. We propose an alternative strategy, called aggregation, which consists of a clustering step of redundant variables and a compression step within each group. We develop a statistical framework to define tailored aggregation methods that can be combined with selection methods to build reliable classifiers that benefit from the information contained in redundant variables. Two algorithms are proposed for ordered and nonordered variables, respectively. Applications to the kNN and CART algorithms are presented.
Collapse
|
687
|
Abstract
DNA microarrays are powerful tools for studying biological mechanisms and for developing prognostic and predictive classifiers for identifying the patients who require treatment and are best candidates for specific treatments. Because microarrays produce so much data from each specimen, they offer great opportunities for discovery and great dangers or producing misleading claims. Microarray based studies require clear objectives for selecting cases and appropriate analysis methods. Effective analysis of microarray data, where the number of measured variables is orders of magnitude greater than the number of cases, requires specialized statistical methods which have recently been developed. Recent literature reviews indicate that serious problems of analysis exist a substantial proportion of publications. This manuscript attempts to provide a non-technical summary of the key principles of statistical design and analysis for studies that utilize microarray expression profiling.
Collapse
Affiliation(s)
- Richard Simon
- Biometric Research Branch, Division of Cancer Treatment & Diagnosis, National Cancer Institute, 9000 Rockville Pike, Bethesda, MD 20892-7434, USA.
| |
Collapse
|
688
|
Romualdi C, Giuliani A, Millino C, Celegato B, Benigni R, Lanfranchi G. Correlation between gene expression and clinical data through linear and nonlinear principal components analyses: muscular dystrophies as case studies. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2009; 13:173-84. [PMID: 19405797 DOI: 10.1089/omi.2009.0003] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The large dimension of microarray data and the complex dependence structure among genes make data analysis extremely challenging. In the last decade several statistical techniques have been proposed to tackle genome-wide expression data; however, clinical and molecular data associated to pathologies have often been considered as separate dimensions of the same phenomenon, especially when clinical variables lie on a multidimensional space. A better comprehension of the relationships between clinical and molecular data can be obtained if both data types are combined and integrated. In this work we adopt a multidimensional correlation strategy together with linear and nonlinear principal component, to integrate genetic and clinical information obtained from two sets of dystrophic patients. With this approach we decompose different aspects of clinical manifestations and correlate these features with the correspondent patterns of differential gene expression.
Collapse
Affiliation(s)
- Chiara Romualdi
- CRIBI Biotechnology Centre and Dipartimento di Biologia, Università degli Studi di Padova, Padova, Italy.
| | | | | | | | | | | |
Collapse
|
689
|
Abstract
Background Microarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes in a single experiment. However, the large number of genes greatly increases the challenges of analyzing, comprehending and interpreting the resulting mass of data. Selecting a subset of important genes is inevitable to address the challenge. Gene selection has been investigated extensively over the last decade. Most selection procedures, however, are not sufficient for accurate inference of underlying biology, because biological significance does not necessarily have to be statistically significant. Additional biological knowledge needs to be integrated into the gene selection procedure. Results We propose a general framework for gene ranking. We construct a bipartite graph from the Gene Ontology (GO) and gene expression data. The graph describes the relationship between genes and their associated molecular functions. Under a species condition, edge weights of the graph are assigned to be gene expression level. Such a graph provides a mathematical means to represent both species-independent and species-dependent biological information. We also develop a new ranking algorithm to analyze the weighted graph via a kernelized spatial depth (KSD) approach. Consequently, the importance of gene and molecular function can be simultaneously ranked by a real-valued measure, KSD, which incorporates the global and local structure of the graph. Over-expressed and under-regulated genes also can be separately ranked. Conclusion The gene-function bigraph integrates molecular function annotations into gene expression data. The relevance of genes is described in the graph (through a common function). The proposed method provides an exploratory framework for gene data analysis.
Collapse
Affiliation(s)
- Cuilan Gao
- Department of Mathematics, The University of Mississippi, University, MS 38677, USA
| | | | | | | |
Collapse
|
690
|
Wei Z, Wang K, Qu HQ, Zhang H, Bradfield J, Kim C, Frackleton E, Hou C, Glessner JT, Chiavacci R, Stanley C, Monos D, Grant SFA, Polychronakos C, Hakonarson H. From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet 2009; 5:e1000678. [PMID: 19816555 PMCID: PMC2748686 DOI: 10.1371/journal.pgen.1000678] [Citation(s) in RCA: 141] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2009] [Accepted: 09/06/2009] [Indexed: 01/22/2023] Open
Abstract
Genome-wide association studies (GWAS) have been fruitful in identifying disease susceptibility loci for common and complex diseases. A remaining question is whether we can quantify individual disease risk based on genotype data, in order to facilitate personalized prevention and treatment for complex diseases. Previous studies have typically failed to achieve satisfactory performance, primarily due to the use of only a limited number of confirmed susceptibility loci. Here we propose that sophisticated machine-learning approaches with a large ensemble of markers may improve the performance of disease risk assessment. We applied a Support Vector Machine (SVM) algorithm on a GWAS dataset generated on the Affymetrix genotyping platform for type 1 diabetes (T1D) and optimized a risk assessment model with hundreds of markers. We subsequently tested this model on an independent Illumina-genotyped dataset with imputed genotypes (1,008 cases and 1,000 controls), as well as a separate Affymetrix-genotyped dataset (1,529 cases and 1,458 controls), resulting in area under ROC curve (AUC) of approximately 0.84 in both datasets. In contrast, poor performance was achieved when limited to dozens of known susceptibility loci in the SVM model or logistic regression model. Our study suggests that improved disease risk assessment can be achieved by using algorithms that take into account interactions between a large ensemble of markers. We are optimistic that genotype-based disease risk assessment may be feasible for diseases where a notable proportion of the risk has already been captured by SNP arrays.
Collapse
Affiliation(s)
- Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey, United States of America
| | - Kai Wang
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Hui-Qi Qu
- Departments of Pediatrics and Human Genetics, McGill University, Montreal, Québec, Canada
| | - Haitao Zhang
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Jonathan Bradfield
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Cecilia Kim
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Edward Frackleton
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Cuiping Hou
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Joseph T. Glessner
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Rosetta Chiavacci
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Charles Stanley
- Division of Endocrinology, Department of Pediatrics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Dimitri Monos
- Department of Pathology and Laboratory Medicine, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Struan F. A. Grant
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
- Division of Genetics, Department of Pediatrics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | | | - Hakon Hakonarson
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
- Division of Genetics, Department of Pediatrics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
691
|
Chakraborty S. Bayesian binary kernel probit model for microarray based cancer classification and gene selection. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2009.05.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
692
|
Niijima S, Okuno Y. Laplacian linear discriminant analysis approach to unsupervised feature selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:605-614. [PMID: 19875859 DOI: 10.1109/tcbb.2007.70257] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Until recently, numerous feature selection techniques have been proposed and found wide applications in genomics and proteomics. For instance, feature/gene selection has proven to be useful for biomarker discovery from microarray and mass spectrometry data. While supervised feature selection has been explored extensively, there are only a few unsupervised methods that can be applied to exploratory data analysis. In this paper, we address the problem of unsupervised feature selection. First, we extend Laplacian linear discriminant analysis (LLDA) to unsupervised cases. Second, we propose a novel algorithm for computing LLDA, which is efficient in the case of high dimensionality and small sample size as in microarray data. Finally, an unsupervised feature selection method, called LLDA-based Recursive Feature Elimination (LLDA-RFE), is proposed. We apply LLDA-RFE to several public data sets of cancer microarrays and compare its performance with those of Laplacian score and SVD-entropy, two state-of-the-art unsupervised methods, and with that of Fisher score, a supervised filter method. Our results demonstrate that LLDA-RFE outperforms Laplacian score and shows favorable performance against SVD-entropy. It performs even better than Fisher score for some of the data sets, despite the fact that LLDA-RFE is fully unsupervised.
Collapse
Affiliation(s)
- Satoshi Niijima
- Department of PharmacoInformatics, Center for Integrative Education of Pharmacy Frontier, Graduate School of Pharmaceutical Sciences, Kyoto University, 46-29 Yoshida Shimoadachi-cho, Sakyo-ku, Kyoto 606-8501, Japan.
| | | |
Collapse
|
693
|
Duval B, Hao JK. Advances in metaheuristics for gene selection and classification of microarray data. Brief Bioinform 2009; 11:127-41. [PMID: 19789265 DOI: 10.1093/bib/bbp035] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Béatrice Duval
- University of Angers, 2 Boulevard Lavoisier, 49045 Angers Cedex 01, France
| | | |
Collapse
|
694
|
Cheng Q, Cheng J. Sparsity optimization method for multivariate feature screening for gene expression analysis. J Comput Biol 2009; 16:1241-52. [PMID: 19772435 DOI: 10.1089/cmb.2008.0034] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Constructing features from high-dimensional gene expression data is a critically important task for monitoring and predicting patients' diseases, or for knowledge discovery in computational molecular biology. The features need to capture the essential characteristics of the data to be maximally distinguishable. Moreover, the essential features usually lie in small or extremely low-dimensional subspaces, and it is crucial to find them for knowledge discovery and pattern classification. We present a computational method for extracting small or even extremely low-dimensional subspaces for multivariate feature screening and gene expression analysis using sparse optimization techniques. After we transform the feature screening problem into a convex optimization problem, we develop an efficient primal-dual interior-point method expressively for solving large-scale problems. The effectiveness of our method is confirmed by our experimental results. The computer programs will be publicly available.
Collapse
Affiliation(s)
- Qiang Cheng
- Computer Science Department, Southern Illinois University , Carbondale, Illinois, USA.
| | | |
Collapse
|
695
|
Development of a voltammetric electronic tongue for discrimination of edible oils. Anal Bioanal Chem 2009; 395:1135-43. [DOI: 10.1007/s00216-009-3070-8] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2009] [Revised: 07/28/2009] [Accepted: 08/14/2009] [Indexed: 10/20/2022]
|
696
|
Sewak MS, Reddy NP, Duan ZH. Gene expression based leukemia sub-classification using committee neural networks. Bioinform Biol Insights 2009; 3:89-98. [PMID: 20140065 PMCID: PMC2808175 DOI: 10.4137/bbi.s2908] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Analysis of gene expression data provides an objective and efficient technique for sub-classification of leukemia. The purpose of the present study was to design a committee neural networks based classification systems to subcategorize leukemia gene expression data. In the study, a binary classification system was considered to differentiate acute lymphoblastic leukemia from acute myeloid leukemia. A ternary classification system which classifies leukemia expression data into three subclasses including B-cell acute lymphoblastic leukemia, T-cell acute lymphoblastic leukemia and acute myeloid leukemia was also developed. In each classification system gene expression profiles of leukemia patients were first subjected to a sequence of simple preprocessing steps. This resulted in filtering out approximately 95 percent of the non-informative genes. The remaining 5 percent of the informative genes were used to train a set of artificial neural networks with different parameters and architectures. The networks that gave the best results during initial testing were recruited into a committee. The committee decision was by majority voting. The committee neural network system was later evaluated using data not used in training. The binary classification system classified microarray gene expression profiles into two categories with 100 percent accuracy and the ternary system correctly predicted the three subclasses of leukemia in over 97 percent of the cases.
Collapse
Affiliation(s)
- Mihir S Sewak
- Department of Biomedical Engineering, University of Akron, Akron, OH 44325-0302
| | | | | |
Collapse
|
697
|
Abstract
Classical prediction methods such as Fisher's linear discriminant function were designed for small-scale problems, where the number of predictors N is much smaller than the number of observations n. Modern scientific devices often reverse this situation. A microarray analysis, for example, might include n = 100 subjects measured on N = 10,000 genes, each of which is a potential predictor. This paper proposes an empirical Bayes approach to large-scale prediction, where the optimum Bayes prediction rule is estimated employing the data from all the predictors. Microarray examples are used to illustrate the method. The results show a close connection with the shrunken centroids algorithm of Tibshirani et al. (2002), a frequentist regularization approach to large-scale prediction, and also with false discovery rate theory.
Collapse
|
698
|
Hall P, Titterington DM, Xue JH. Tilting methods for assessing the influence of components in a classifier. J R Stat Soc Series B Stat Methodol 2009. [DOI: 10.1111/j.1467-9868.2009.00701.x] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
699
|
Abstract
Several sparseness penalties have been suggested for delivery of good predictive performance in automatic variable selection within the framework of regularization. All assume that the true model is sparse. We propose a penalty, a convex combination of the L1- and L∞-norms, that adapts to a variety of situations including sparseness and nonsparseness, grouping and nongrouping. The proposed penalty performs grouping and adaptive regularization. In addition, we introduce a novel homotopy algorithm utilizing subgradients for developing regularization solution surfaces involving multiple regularizers. This permits efficient computation and adaptive tuning. Numerical experiments are conducted using simulation. In simulated and real examples, the proposed penalty compares well against popular alternatives.
Collapse
Affiliation(s)
- S Wu
- School of Statistics , University of Minnesota , 313 Ford Hall, 224 Church Street S. E., Minneapolis, Minnesota 55455 , U.S.A.
| | | | | |
Collapse
|
700
|
Jonker MJ, Bruning O, van Iterson M, Schaap MM, van der Hoeven TV, Vrieling H, Beems RB, de Vries A, van Steeg H, Breit TM, Luijten M. Finding transcriptomics biomarkers for in vivo identification of (non-)genotoxic carcinogens using wild-type and Xpa/p53 mutant mouse models. Carcinogenesis 2009; 30:1805-12. [DOI: 10.1093/carcin/bgp190] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
|