401
|
De Boever P, Wens B, Forcheh AC, Reynders H, Nelen V, Kleinjans J, Van Larebeke N, Verbeke G, Valkenborg D, Schoeters G. Characterization of the peripheral blood transcriptome in a repeated measures design using a panel of healthy individuals. Genomics 2013; 103:31-9. [PMID: 24321174 DOI: 10.1016/j.ygeno.2013.11.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2013] [Revised: 06/14/2013] [Accepted: 11/29/2013] [Indexed: 01/01/2023]
Abstract
A repeated measures microarray design with 22 healthy, non-smoking volunteers (aging 32±5years) was set up to study transcriptome profiles in whole blood samples. The results indicate that repeatable data can be obtained with high within-subject correlation. Probes that could discriminate between individuals are associated with immune and inflammatory functions. When investigating possible time trends in the microarray data, we have found no differential expression within a sampling period (within-season effect). Differential expression was observed between sampling seasons and the data suggest a weak response of genes related to immune system functioning. Finally, a high number of probes showed significant season-specific expression variability within subjects. Expression variability increased in springtime and there was an association of the probe list with immune system functioning. Our study suggests that the blood transcriptome of healthy individuals is reproducible over a time period of several months.
Collapse
Affiliation(s)
- Patrick De Boever
- Flemish Institute for Technological Research, Unit Environmental Risk and Health, Belgium; Hasselt University, Centre for Environmental Sciences, Belgium.
| | - Britt Wens
- Flemish Institute for Technological Research, Unit Environmental Risk and Health, Belgium
| | - Anyiawung Chiara Forcheh
- Catholic University of Leuven, Interuniversity Institute for Biostatistics and Statistical Bioinformatics, Belgium
| | - Hans Reynders
- Flemish Government, Environment, Nature and Energy Department, Belgium
| | - Vera Nelen
- Provincial Institute for Hygiene, Belgium
| | - Jos Kleinjans
- Maastricht University, Department of Toxicogenomics, The Netherlands
| | - Nicolas Van Larebeke
- Ghent University, Study Centre for Carcinogenesis and Primary Prevention of Cancer, Belgium
| | - Geert Verbeke
- Catholic University of Leuven, Interuniversity Institute for Biostatistics and Statistical Bioinformatics, Belgium
| | - Dirk Valkenborg
- Flemish Institute for Technological Research, Unit Environmental Risk and Health, Belgium; Hasselt University, Interuniversity Institute for Biostatistics and Statistical Bioinformatics, Belgium
| | - Greet Schoeters
- Flemish Institute for Technological Research, Unit Environmental Risk and Health, Belgium; University of Antwerp, Department of Biomedical Sciences, Belgium; University of Southern Denmark, Department of Environmental Medicine, Denmark
| |
Collapse
|
402
|
Reese SE, Archer KJ, Therneau TM, Atkinson EJ, Vachon CM, de Andrade M, Kocher JPA, Eckel-Passow JE. A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis. Bioinformatics 2013; 29:2877-83. [PMID: 23958724 PMCID: PMC3810845 DOI: 10.1093/bioinformatics/btt480] [Citation(s) in RCA: 92] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2012] [Revised: 07/03/2013] [Accepted: 08/14/2013] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. RESULTS We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. CONCLUSION We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well. AVAILABILITY AND IMPLEMENTATION The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article. CONTACT reesese@vcu.edu
Collapse
Affiliation(s)
- Sarah E Reese
- Department of Biostatistics, Biostatistics Shared Resource Core, VCU Massey Cancer Center, Virginia Commonwealth University, Richmond, VA 23284, USA, Division of Biomedical Statistics and Informatics and Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA
| | | | | | | | | | | | | | | |
Collapse
|
403
|
Ma C, Dong X, Li R, Liu L. A computational study identifies HIV progression-related genes using mRMR and shortest path tracing. PLoS One 2013; 8:e78057. [PMID: 24244287 PMCID: PMC3823927 DOI: 10.1371/journal.pone.0078057] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2013] [Accepted: 09/13/2013] [Indexed: 01/18/2023] Open
Abstract
Since statistical relationships between HIV load and CD4+ T cell loss have been demonstrated to be weak, searching for host factors contributing to the pathogenesis of HIV infection becomes a key point for both understanding the disease pathology and developing treatments. We applied Maximum Relevance Minimum Redundancy (mRMR) algorithm to a set of microarray data generated from the CD4+ T cells of viremic non-progressors (VNPs) and rapid progressors (RPs) to identify host factors associated with the different responses to HIV infection. Using mRMR algorithm, 147 gene had been identified. Furthermore, we constructed a weighted molecular interaction network with the existing protein-protein interaction data from STRING database and identified 1331 genes on the shortest-paths among the genes identified with mRMR. Functional analysis shows that the functions relating to apoptosis play important roles during the pathogenesis of HIV infection. These results bring new insights of understanding HIV progression.
Collapse
Affiliation(s)
- Chengcheng Ma
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R. China
- University of Chinese Academy of Sciences, Beijing, P.R. China
| | - Xiao Dong
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R. China
- University of Chinese Academy of Sciences, Beijing, P.R. China
- Shanghai Center for Bioinformation Technology, Shanghai, P.R. China
| | - Rudong Li
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R. China
- University of Chinese Academy of Sciences, Beijing, P.R. China
| | - Lei Liu
- Institutes for Biomedical Sciences, Fudan University, Shanghai, P.R. China
- * E-mail:
| |
Collapse
|
404
|
Moon H, Lopez KL, Lin GI, Chen JJ. Sex-specific genomic biomarkers for individualized treatment of life-threatening diseases. DISEASE MARKERS 2013; 35:661-7. [PMID: 24302811 PMCID: PMC3834650 DOI: 10.1155/2013/393020] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 06/30/2013] [Revised: 10/07/2013] [Accepted: 10/20/2013] [Indexed: 11/18/2022]
Abstract
Numerous studies have demonstrated sex differences in drug reactions to the same drug treatment, steering away from the traditional view of one-size-fits-all medicine. A premise of this study is that the sex of a patient influences difference in disease characteristics and risk factors. In this study, we intend to exploit and to obtain better sex-specific biomarkers from gene-expression data. We propose a procedure to isolate a set of important genes as sex-specific genomic biomarkers, which may enable more effective patient treatment. A set of sex-specific genes is obtained by a variable importance ranking using a combination of cross-validation methods. The proposed procedure is applied to three gene-expression datasets.
Collapse
MESH Headings
- Adolescent
- Algorithms
- Biomarkers/metabolism
- Female
- Gene Expression Profiling
- Genetic Markers
- Genome, Human
- Humans
- Leukemia, Lymphocytic, Chronic, B-Cell/genetics
- Leukemia, Lymphocytic, Chronic, B-Cell/metabolism
- Leukemia, Lymphocytic, Chronic, B-Cell/therapy
- Leukemia, Myeloid, Acute/genetics
- Leukemia, Myeloid, Acute/metabolism
- Leukemia, Myeloid, Acute/therapy
- Male
- Melanoma/genetics
- Melanoma/metabolism
- Melanoma/therapy
- Models, Genetic
- Precision Medicine
- Prognosis
- Risk Factors
- Sex Characteristics
- Skin Neoplasms/genetics
- Skin Neoplasms/metabolism
- Skin Neoplasms/therapy
- Transcriptome
Collapse
Affiliation(s)
- Hojin Moon
- Department of Mathematics and Statistics, California State University, 1250 Bellflower Boulevard, Long Beach, CA 90840-1001, USA
| | - Karen L. Lopez
- Department of Mathematics and Statistics, California State University, 1250 Bellflower Boulevard, Long Beach, CA 90840-1001, USA
| | - Grace I. Lin
- Department of Computer Science, University of California, Santa Cruz, CA 95064, USA
| | - James J. Chen
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, FDA, Jefferson, AR 72079, USA
- Graduate Institute of Biostatistics and Biostatistics Center, China Medical University, Taichung, Taiwan
| |
Collapse
|
405
|
A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. ANN I STAT MATH 2013. [DOI: 10.1007/s10463-013-0435-8] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
406
|
McShane LM, Cavenagh MM, Lively TG, Eberhard DA, Bigbee WL, Williams PM, Mesirov JP, Polley MYC, Kim KY, Tricoli JV, Taylor JMG, Shuman DJ, Simon RM, Doroshow JH, Conley BA. Criteria for the use of omics-based predictors in clinical trials: explanation and elaboration. BMC Med 2013; 11:220. [PMID: 24228635 PMCID: PMC3852338 DOI: 10.1186/1741-7015-11-220] [Citation(s) in RCA: 91] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/28/2013] [Accepted: 08/06/2013] [Indexed: 12/18/2022] Open
Abstract
High-throughput 'omics' technologies that generate molecular profiles for biospecimens have been extensively used in preclinical studies to reveal molecular subtypes and elucidate the biological mechanisms of disease, and in retrospective studies on clinical specimens to develop mathematical models to predict clinical endpoints. Nevertheless, the translation of these technologies into clinical tests that are useful for guiding management decisions for patients has been relatively slow. It can be difficult to determine when the body of evidence for an omics-based test is sufficiently comprehensive and reliable to support claims that it is ready for clinical use, or even that it is ready for definitive evaluation in a clinical trial in which it may be used to direct patient therapy. Reasons for this difficulty include the exploratory and retrospective nature of many of these studies, the complexity of these assays and their application to clinical specimens, and the many potential pitfalls inherent in the development of mathematical predictor models from the very high-dimensional data generated by these omics technologies. Here we present a checklist of criteria to consider when evaluating the body of evidence supporting the clinical use of a predictor to guide patient therapy. Included are issues pertaining to specimen and assay requirements, the soundness of the process for developing predictor models, expectations regarding clinical study design and conduct, and attention to regulatory, ethical, and legal issues. The proposed checklist should serve as a useful guide to investigators preparing proposals for studies involving the use of omics-based tests. The US National Cancer Institute plans to refer to these guidelines for review of proposals for studies involving omics tests, and it is hoped that other sponsors will adopt the checklist as well.
Collapse
Affiliation(s)
- Lisa M McShane
- Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Room 5W130, MSC 9735, 9609 Medical Center Drive, Bethesda, MD 20892-9735, USA
| | - Margaret M Cavenagh
- Cancer Diagnosis Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Room 4W432, MSC 9730, 9609 Medical Center Drive, Bethesda, MD 20892, USA
| | - Tracy G Lively
- Cancer Diagnosis Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Room 4W420, MSC 9730, 9609 Medical Center Drive, Bethesda, MD 20892, USA
| | - David A Eberhard
- Department of Pathology and Lineberger Comprehensive Cancer Center, Brinkhous-Bullitt Bldg., Campus Box 7525, University of North Carolina, Chapel Hill, NC 27599, USA
| | - William L Bigbee
- Department of Pathology and University of Pittsburgh Cancer Institute, Hillman Cancer Center, UPCI Research Pavilion, Suite 2.32b, 5117 Centre Avenue, Pittsburgh, PA 15213, USA
| | - P Mickey Williams
- Frederick National Laboratory for Cancer Research, National Cancer Institute, National Institutes of Health, Bldg. 320, Room 2, 1050 Boyles Street, Frederick, MD 21702, USA
| | - Jill P Mesirov
- Computational Biology and Bioinformatics, Broad Institute of Massachusetts Institute of Technology and Harvard University, 7 Cambridge Center, Cambridge, MA 02142, USA
| | - Mei-Yin C Polley
- Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Room 5W638, 9609 Medical Center Drive, Bethesda, MD 20892, USA
| | - Kelly Y Kim
- Cancer Diagnosis Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Room 4W430, 9609 Medical Center Drive, Bethesda, MD 20892, USA
| | - James V Tricoli
- Cancer Diagnosis Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Room 3W526, 9609 Medical Center Drive, Bethesda, MD 20892, USA
| | - Jeremy MG Taylor
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA
| | - Deborah J Shuman
- Office of the Director, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Room 3A44, 31 Center Drive, Bethesda, MD 20892, USA
| | - Richard M Simon
- Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Room 5W110, 9609 Medical Center Drive, Bethesda, MD 20892, USA
| | - James H Doroshow
- Office of the Director, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Room 3A44, 31 Center Drive, Bethesda, MD 20892, USA
| | - Barbara A Conley
- Cancer Diagnosis Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Room 4W426, 9609 Medical Center Drive, Bethesda, MD 20892, USA
| |
Collapse
|
407
|
Complexity-reduced implementations of complete and null-space-based linear discriminant analysis. Neural Netw 2013; 46:165-71. [DOI: 10.1016/j.neunet.2013.05.010] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2013] [Revised: 05/07/2013] [Accepted: 05/15/2013] [Indexed: 11/21/2022]
|
408
|
Wang C, Cao L, Miao B. Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data. Comput Stat Data Anal 2013. [DOI: 10.1016/j.csda.2013.04.003] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
409
|
Conde D, Salvador B, Rueda C, Fernández MA. Performance and estimation of the true error rate of classification rules built with additional information. An application to a cancer trial. Stat Appl Genet Mol Biol 2013; 12:583-602. [PMID: 24025649 DOI: 10.1515/sagmb-2012-0037] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Classification rules that incorporate additional information usually present in discrimination problems are receiving certain attention during the last years as they perform better than the usual rules. Fernández, M. A., C. Rueda and B. Salvador (2006): "Incorporating additional information to normal linear discriminant rules," J. Am. Stat. Assoc., 101, 569-577, proved that these rules have lower total misclassification probability than the usual Fisher's rule. In this paper we consider two issues; on the one hand, we compare these rules with those based on shrinkage estimators of the mean proposed by Tong, T., L. Chen and H. Zhao (2012): "Improved mean estimation and its application to diagonal discriminant analysis," Bioinformatics, 28(4): 531-537. with regard to four criteria: total misclassification probability, area under ROC curve, well-calibratedness and refinement; on the other hand, we consider the estimation of the true error rate, which is a very interesting parameter in applications. We prove results on the apparent error rate of the rules that expose the need of new estimators of their true error rate. We propose four such new estimators. Two of them are defined incorporating the additional information into the leave-one-out-bootstrap. The other two are the corresponding cross-validation after bootstrap versions. We compare these estimators with the usual ones in a simulation study and in a cancer trial application, showing the good behavior of the rules that incorporate additional information and of the new leave-one-out bootstrap estimators of their true error rate.
Collapse
|
410
|
Hajiloo M, Rabiee HR, Anooshahpour M. Fuzzy support vector machine: an efficient rule-based classification technique for microarrays. BMC Bioinformatics 2013; 14 Suppl 13:S4. [PMID: 24266942 PMCID: PMC3849760 DOI: 10.1186/1471-2105-14-s13-s4] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The abundance of gene expression microarray data has led to the development of machine learning algorithms applicable for tackling disease diagnosis, disease prognosis, and treatment selection problems. However, these algorithms often produce classifiers with weaknesses in terms of accuracy, robustness, and interpretability. This paper introduces fuzzy support vector machine which is a learning algorithm based on combination of fuzzy classifiers and kernel machines for microarray classification. RESULTS Experimental results on public leukemia, prostate, and colon cancer datasets show that fuzzy support vector machine applied in combination with filter or wrapper feature selection methods develops a robust model with higher accuracy than the conventional microarray classification models such as support vector machine, artificial neural network, decision trees, k nearest neighbors, and diagonal linear discriminant analysis. Furthermore, the interpretable rule-base inferred from fuzzy support vector machine helps extracting biological knowledge from microarray data. CONCLUSIONS Fuzzy support vector machine as a new classification model with high generalization power, robustness, and good interpretability seems to be a promising tool for gene expression microarray classification.
Collapse
|
411
|
Yamada T, Hyodo M, Seo T. The Asymptotic Approximation of EPMC for Linear Discriminant Rules Using a Moore-Penrose Inverse Matrix in High Dimension. COMMUN STAT-THEOR M 2013. [DOI: 10.1080/03610926.2011.628768] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
412
|
Duverle DA, Takeuchi I, Murakami-Tonami Y, Kadomatsu K, Tsuda K. Discovering combinatorial interactions in survival data. ACTA ACUST UNITED AC 2013; 29:3053-9. [PMID: 24037215 PMCID: PMC3834797 DOI: 10.1093/bioinformatics/btt532] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Motivation: Although several methods exist to relate high-dimensional gene expression data to various clinical phenotypes, finding combinations of features in such input remains a challenge, particularly when fitting complex statistical models such as those used for survival studies. Results: Our proposed method builds on existing ‘regularization path-following’ techniques to produce regression models that can extract arbitrarily complex patterns of input features (such as gene combinations) from large-scale data that relate to a known clinical outcome. Through the use of the data’s structure and itemset mining techniques, we are able to avoid combinatorial complexity issues typically encountered with such methods, and our algorithm performs in similar orders of duration as single-variable versions. Applied to data from various clinical studies of cancer patient survival time, our method was able to produce a number of promising gene-interaction candidates whose tumour-related roles appear confirmed by literature. Availability: An R implementation of the algorithm described in this article can be found at https://github.com/david-duverle/regularisation-path-following Contact:dave.duverle@aist.go.jp Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David A Duverle
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan, Department of Computer Science, Nagoya Institute of Technology, Nagoya, Japan, Division of Molecular Oncology, Aichi Cancer Center, Nagoya, Japan and Department of Molecular Biology, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | | | | | | | | |
Collapse
|
413
|
|
414
|
Hijazi H, Chan C. A classification framework applied to cancer gene expression profiles. JOURNAL OF HEALTHCARE ENGINEERING 2013; 4:255-83. [PMID: 23778014 DOI: 10.1260/2040-2295.4.2.255] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Classification of cancer based on gene expression has provided insight into possible treatment strategies. Thus, developing machine learning methods that can successfully distinguish among cancer subtypes or normal versus cancer samples is important. This work discusses supervised learning techniques that have been employed to classify cancers. Furthermore, a two-step feature selection method based on an attribute estimation method (e.g., ReliefF) and a genetic algorithm was employed to find a set of genes that can best differentiate between cancer subtypes or normal versus cancer samples. The application of different classification methods (e.g., decision tree, k-nearest neighbor, support vector machine (SVM), bagging, and random forest) on 5 cancer datasets shows that no classification method universally outperforms all the others. However, k-nearest neighbor and linear SVM generally improve the classification performance over other classifiers. Finally, incorporating diverse types of genomic data (e.g., protein-protein interaction data and gene expression) increase the prediction accuracy as compared to using gene expression alone.
Collapse
Affiliation(s)
- Hussein Hijazi
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.
| | | |
Collapse
|
415
|
Huang L, Zhang HH, Zeng ZB, Bushel PR. Improved Sparse Multi-Class SVM and Its Application for Gene Selection in Cancer Classification. Cancer Inform 2013; 12:143-53. [PMID: 23966761 PMCID: PMC3740816 DOI: 10.4137/cin.s10212] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
BACKGROUND Microarray techniques provide promising tools for cancer diagnosis using gene expression profiles. However, molecular diagnosis based on high-throughput platforms presents great challenges due to the overwhelming number of variables versus the small sample size and the complex nature of multi-type tumors. Support vector machines (SVMs) have shown superior performance in cancer classification due to their ability to handle high dimensional low sample size data. The multi-class SVM algorithm of Crammer and Singer provides a natural framework for multi-class learning. Despite its effective performance, the procedure utilizes all variables without selection. In this paper, we propose to improve the procedure by imposing shrinkage penalties in learning to enforce solution sparsity. RESULTS The original multi-class SVM of Crammer and Singer is effective for multi-class classification but does not conduct variable selection. We improved the method by introducing soft-thresholding type penalties to incorporate variable selection into multi-class classification for high dimensional data. The new methods were applied to simulated data and two cancer gene expression data sets. The results demonstrate that the new methods can select a small number of genes for building accurate multi-class classification rules. Furthermore, the important genes selected by the methods overlap significantly, suggesting general agreement among different variable selection schemes. CONCLUSIONS High accuracy and sparsity make the new methods attractive for cancer diagnostics with gene expression data and defining targets of therapeutic intervention. AVAILABILITY The source MATLAB code are available from http://math.arizona.edu/~hzhang/software.html.
Collapse
Affiliation(s)
- Lingkang Huang
- GlaxoSmithKline, Research and Development, Division of Quantitative Sciences, Research Triangle Park, NC 27709, USA. ; Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA. ; Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA
| | | | | | | |
Collapse
|
416
|
Hossain A, Willan AR, Beyene J. An Improved Method on Wilcoxon Rank Sum Test for Gene Selection from Microarray Experiments. COMMUN STAT-SIMUL C 2013. [DOI: 10.1080/03610918.2012.667479] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
417
|
Gusnanto A, Ploner A, Shuweihdi F, Pawitan Y. Partial least squares and logistic regression random-effects estimates for gene selection in supervised classification of gene expression data. J Biomed Inform 2013; 46:697-709. [DOI: 10.1016/j.jbi.2013.05.008] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2012] [Revised: 05/24/2013] [Accepted: 05/24/2013] [Indexed: 10/26/2022]
|
418
|
Cui Y, Zheng CH, Yang J, Sha W. Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data. Comput Biol Med 2013; 43:933-41. [DOI: 10.1016/j.compbiomed.2013.04.018] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2011] [Revised: 04/25/2013] [Accepted: 04/26/2013] [Indexed: 10/26/2022]
|
419
|
A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification. J Biomed Inform 2013; 46:1044-59. [PMID: 23892294 DOI: 10.1016/j.jbi.2013.07.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2012] [Revised: 07/05/2013] [Accepted: 07/21/2013] [Indexed: 01/02/2023]
Abstract
Gene expression profile classification is a pivotal research domain assisting in the transformation from traditional to personalized medicine. A major challenge associated with gene expression data classification is the small number of samples relative to the large number of genes. To address this problem, researchers have devised various feature selection algorithms to reduce the number of genes. Recent studies have been experimenting with the use of semantic similarity between genes in Gene Ontology (GO) as a method to improve feature selection. While there are few studies that discuss how to use GO for feature selection, there is no simulation study that addresses when to use GO-based feature selection. To investigate this, we developed a novel simulation, which generates binary class datasets, where the differentially expressed genes between two classes have some underlying relationship in GO. This allows us to investigate the effects of various factors such as the relative connectedness of the underlying genes in GO, the mean magnitude of separation between differentially expressed genes denoted by δ, and the number of training samples. Our simulation results suggest that the connectedness in GO of the differentially expressed genes for a biological condition is the primary factor for determining the efficacy of GO-based feature selection. In particular, as the connectedness of differentially expressed genes increases, the classification accuracy improvement increases. To quantify this notion of connectedness, we defined a measure called Biological Condition Annotation Level BCAL(G), where G is a graph of differentially expressed genes. Our main conclusions with respect to GO-based feature selection are the following: (1) it increases classification accuracy when BCAL(G) ≥ 0.696; (2) it decreases classification accuracy when BCAL(G) ≤ 0.389; (3) it provides marginal accuracy improvement when 0.389<BCAL(G)<0.696 and δ<1; (4) as the number of genes in a biological condition increases beyond 50 and δ ≥ 0.7, the improvement from GO-based feature selection decreases; and (5) we recommend not using GO-based feature selection when a biological condition has less than ten genes. Our results are derived from datasets preprocessed using RMA (Robust Multi-array Average), cases where δ is between 0.3 and 2.5, and training sample sizes between 20 and 200, therefore our conclusions are limited to these specifications. Overall, this simulation is innovative and addresses the question of when SoFoCles-style feature selection should be used for classification instead of statistical-based ranking measures.
Collapse
|
420
|
Torrente A, López-Pintado S, Romo J. DepthTools: an R package for a robust analysis of gene expression data. BMC Bioinformatics 2013; 14:237. [PMID: 23885712 PMCID: PMC3750619 DOI: 10.1186/1471-2105-14-237] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2013] [Accepted: 07/17/2013] [Indexed: 11/18/2022] Open
Abstract
Background The use of DNA microarrays and oligonucleotide chips of high density in modern biomedical research provides complex, high dimensional data which have been proven to convey crucial information about gene expression levels and to play an important role in disease diagnosis. Therefore, there is a need for developing new, robust statistical techniques to analyze these data. Results depthTools is an R package for a robust statistical analysis of gene expression data, based on an efficient implementation of a feasible notion of depth, the Modified Band Depth. This software includes several visualization and inference tools successfully applied to high dimensional gene expression data. A user-friendly interface is also provided via an R-commander plugin. Conclusion We illustrate the utility of the depthTools package, that could be used, for instance, to achieve a better understanding of genome-level variation between tumors and to facilitate the development of personalized treatments.
Collapse
Affiliation(s)
- Aurora Torrente
- Functional Genomics Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK.
| | | | | |
Collapse
|
421
|
Chakraborty S, Datta S, Datta S. svapls: an R package to correct for hidden factors of variability in gene expression studies. BMC Bioinformatics 2013; 14:236. [PMID: 23883280 PMCID: PMC3733742 DOI: 10.1186/1471-2105-14-236] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2013] [Accepted: 07/16/2013] [Indexed: 11/23/2022] Open
Abstract
Background Hidden variability is a fundamentally important issue in the context of gene expression studies. Collected tissue samples may have a wide variety of hidden effects that may alter their transcriptional landscape significantly. As a result their actual differential expression pattern can be potentially distorted, leading to inaccurate results from a genome-wide testing for the important transcripts. Results We present an R package svapls that can be used to identify several types of unknown sample-specific sources of heterogeneity in a gene expression study and adjust for them in order to provide a more accurate inference on the original expression pattern of the genes over different varieties of samples. The proposed method implements Partial Least Squares regression to extract the hidden signals of sample-specific heterogeneity in the data and uses them to find the genes that are actually correlated with the phenotype of interest. We also compare our package with three other popular softwares for testing differential gene expression along with a detailed illustration on the widely popular Golub dataset. Results from the sensitivity analyes on simulated data with widely different hidden variation patterns reveal the improved detection power of our R package compared to the other softwares along with reasonably smaller error rates. Application on the real-life dataset exhibits the efficacy of the R package in detecting potential batch effects from the dataset. Conclusions Overall, Our R package provides the user with a simplified framework for analyzing gene expression data with a wide range of hidden variation patterns and delivering a differential gene expression analysis with substantially improved power and accuracy. The R package svapls is freely available at http://cran.r-project.org/web/packages/svapls/index.html.
Collapse
Affiliation(s)
- Sutirtha Chakraborty
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY-40202, USA.
| | | | | |
Collapse
|
422
|
Dobbin KK, Song X. Sample size requirements for training high-dimensional risk predictors. Biostatistics 2013; 14:639-52. [PMID: 23873895 DOI: 10.1093/biostatistics/kxt022] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed.
Collapse
Affiliation(s)
- Kevin K Dobbin
- College of Public Health, University of Georgia, 101 Buck Road, Athens, GA 30602, USA
| | | |
Collapse
|
423
|
Exploring correlations in gene expression microarray data for maximum predictive-minimum redundancy biomarker selection and classification. Comput Biol Med 2013; 43:1437-43. [PMID: 24034735 DOI: 10.1016/j.compbiomed.2013.07.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2012] [Revised: 07/02/2013] [Accepted: 07/04/2013] [Indexed: 12/27/2022]
Abstract
An important issue in the analysis of gene expression microarray data is concerned with the extraction of valuable genetic interactions from high dimensional data sets containing gene expression levels collected for a small sample of assays. Past and ongoing research efforts have been focused on biomarker selection for phenotype classification. Usually, many genes convey useless information for classifying the outcome and should be removed from the analysis; on the other hand, some of them may be highly correlated, which reveals the presence of redundant expressed information. In this paper we propose a method for the selection of highly predictive genes having a low redundancy in their expression levels. The predictive accuracy of the selection is assessed by means of Classification and Regression Trees (CART) models which enable assessment of the performance of the selected genes for classifying the outcome variable and will also uncover complex genetic interactions. The method is illustrated throughout the paper using a public domain colon cancer gene expression data set.
Collapse
|
424
|
Shao L, Fan X, Cheng N, Wu L, Cheng Y. Determination of minimum training sample size for microarray-based cancer outcome prediction-an empirical assessment. PLoS One 2013; 8:e68579. [PMID: 23861920 PMCID: PMC3702597 DOI: 10.1371/journal.pone.0068579] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2012] [Accepted: 05/31/2013] [Indexed: 11/19/2022] Open
Abstract
The promise of microarray technology in providing prediction classifiers for cancer outcome estimation has been confirmed by a number of demonstrable successes. However, the reliability of prediction results relies heavily on the accuracy of statistical parameters involved in classifiers. It cannot be reliably estimated with only a small number of training samples. Therefore, it is of vital importance to determine the minimum number of training samples and to ensure the clinical value of microarrays in cancer outcome prediction. We evaluated the impact of training sample size on model performance extensively based on 3 large-scale cancer microarray datasets provided by the second phase of MicroArray Quality Control project (MAQC-II). An SSNR-based (scale of signal-to-noise ratio) protocol was proposed in this study for minimum training sample size determination. External validation results based on another 3 cancer datasets confirmed that the SSNR-based approach could not only determine the minimum number of training samples efficiently, but also provide a valuable strategy for estimating the underlying performance of classifiers in advance. Once translated into clinical routine applications, the SSNR-based protocol would provide great convenience in microarray-based cancer outcome prediction in improving classifier reliability.
Collapse
Affiliation(s)
- Li Shao
- Pharmaceutical Informatics Institute, School of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Xiaohui Fan
- Pharmaceutical Informatics Institute, School of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- * E-mail: (XF); (YC)
| | - Ningtao Cheng
- The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, Georgia, United States of America
| | - Leihong Wu
- Pharmaceutical Informatics Institute, School of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yiyu Cheng
- State Key Laboratory for Diagnosis and Treatment of Infectious Disease, First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
- * E-mail: (XF); (YC)
| |
Collapse
|
425
|
Wang Z, Deisboeck TS. Mathematical modeling in cancer drug discovery. Drug Discov Today 2013; 19:145-50. [PMID: 23831857 DOI: 10.1016/j.drudis.2013.06.015] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2013] [Revised: 06/25/2013] [Accepted: 06/27/2013] [Indexed: 12/20/2022]
Abstract
Mathematical models have the potential to help discover new therapeutic targets and treatment strategies. In this review, we discuss how the latest developments in mathematical modeling can provide useful context for the rational design, validation and prioritization of novel cancer drug targets and their combinations. We give special attention to two modeling approaches: network-based modeling and multiscale modeling, because they have begun to show promise in facilitating the process of effective cancer drug discovery. Both modeling approaches are integrated with a variety of experimental methods to ensure proper parameterization and to maximize their predictive value. We also discuss several challenges faced in modeling-based drug discovery.
Collapse
Affiliation(s)
- Zhihui Wang
- Department of Pathology, University of New Mexico, Albuquerque, NM 87131, USA
| | | |
Collapse
|
426
|
Elad T, Belkin S. Broad spectrum detection and "barcoding" of water pollutants by a genome-wide bacterial sensor array. WATER RESEARCH 2013; 47:3782-3790. [PMID: 23726715 DOI: 10.1016/j.watres.2013.04.011] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/31/2012] [Revised: 03/05/2013] [Accepted: 04/09/2013] [Indexed: 06/02/2023]
Abstract
An approach for the rapid detection and classification of a broad spectrum of water pollutants, based on a genome-wide reporter bacterial live cell array, is proposed and demonstrated. An array of ca. 2000 Escherichia coli fluorescent transcriptional reporters was exposed to 25 toxic compounds as well as to unpolluted water, and its responses were recorded after 3 h. The 25 toxic compounds represented 5 pollutant classes: genotoxicants, metals, detergents, alcohols, and monoaromatic hydrocarbons. Identifying unique gene expression patterns, a nearest neighbour-based model detected pollutant presence and predicted class attribution with an estimated accuracy of 87%. Sensitivity and positive predictive values varied among classes, being higher for pollutant classes that were defined by mode of action than for those defined by structure only. Sensitivity for unpolluted water was 0.90 and the positive predictive value was 0.79. All pollutant classes induced the transcription of a statistically significant proportion of membrane associated genes; in addition, the sets of genes responsive to genotoxicants, detergents and alcohols were enriched with genes involved in DNA repair, iron utilization and the translation machinery, respectively. Following further development, a methodology of the type described herein may be suitable for integration in water monitoring schemes in conjunction with existing analytical and biological detection techniques.
Collapse
Affiliation(s)
- Tal Elad
- Department of Plant and Environmental Sciences, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 91904, Israel
| | | |
Collapse
|
427
|
Fan J, Liu H. Statistical analysis of big data on pharmacogenomics. Adv Drug Deliv Rev 2013; 65:987-1000. [PMID: 23602905 PMCID: PMC3701723 DOI: 10.1016/j.addr.2013.04.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2013] [Revised: 04/07/2013] [Accepted: 04/10/2013] [Indexed: 01/29/2023]
Abstract
This paper discusses statistical methods for estimating complex correlation structure from large pharmacogenomic datasets. We selectively review several prominent statistical methods for estimating large covariance matrix for understanding correlation structure, inverse covariance matrix for network modeling, large-scale simultaneous tests for selecting significantly differently expressed genes and proteins and genetic markers for complex diseases, and high dimensional variable selection for identifying important molecules for understanding molecule mechanisms in pharmacogenomics. Their applications to gene network estimation and biomarker selection are used to illustrate the methodological power. Several new challenges of Big data analysis, including complex data distribution, missing data, measurement error, spurious correlation, endogeneity, and the need for robust statistical methods, are also discussed.
Collapse
Affiliation(s)
- Jianqing Fan
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA.
| | | |
Collapse
|
428
|
Subramanian J, Simon R. Overfitting in prediction models - is it a problem only in high dimensions? Contemp Clin Trials 2013; 36:636-41. [PMID: 23811117 DOI: 10.1016/j.cct.2013.06.011] [Citation(s) in RCA: 99] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2013] [Revised: 05/29/2013] [Accepted: 06/17/2013] [Indexed: 02/07/2023]
Abstract
The growing recognition that human diseases are molecularly heterogeneous has stimulated interest in the development of prognostic and predictive classifiers for patient selection and stratification. In the process of classifier development, it has been repeatedly emphasized that in situations where the number of candidate predictor variables is much larger than the number of observations, the apparent (training set, resubstitution) accuracy of the classifiers can be highly optimistically biased and hence, classification accuracy should be reported based on evaluation of the classifier on a separate test set or using complete cross-validation. Such evaluation methods have however not been the norm in the case of low-dimensional, p<n data that arise, for example, in clinical trials when a classifier is developed on a combination of clinico-pathological variables and a small number of genetic biomarkers selected from an understanding of the biology of the disease. We undertook simulation studies to investigate the existence and extent of the problem of overfitting with low-dimensional data. The results indicate that overfitting can be a serious problem even for low-dimensional data, especially if the relationship of outcome to the set of predictor variables is not strong. We hence encourage the adoption of either a separate test set or complete cross-validation to evaluate classifier accuracy, even when the number of candidate predictor variables is substantially smaller than the number of cases.
Collapse
|
429
|
Clelland CL, Read LL, Panek LJ, Nadrich RH, Bancroft C, Clelland JD. Utilization of never-medicated bipolar disorder patients towards development and validation of a peripheral biomarker profile. PLoS One 2013; 8:e69082. [PMID: 23826396 PMCID: PMC3691117 DOI: 10.1371/journal.pone.0069082] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2012] [Accepted: 06/11/2013] [Indexed: 12/21/2022] Open
Abstract
There are currently no biological tests that differentiate patients with bipolar disorder (BPD) from healthy controls. While there is evidence that peripheral gene expression differences between patients and controls can be utilized as biomarkers for psychiatric illness, it is unclear whether current use or residual effects of antipsychotic and mood stabilizer medication drives much of the differential transcription. We therefore tested whether expression changes in first-episode, never-medicated BPD patients, can contribute to a biological classifier that is less influenced by medication and could potentially form a practicable biomarker assay for BPD. We employed microarray technology to measure global leukocyte gene expression in first-episode (n=3) and currently medicated BPD patients (n=26), and matched healthy controls (n=25). Following an initial feature selection of the microarray data, we developed a cross-validated 10-gene model that was able to correctly predict the diagnostic group of the training sample (26 medicated patients and 12 controls), with 89% sensitivity and 75% specificity (p<0.001). The 10-gene predictor was further explored via testing on an independent cohort consisting of three pairs of monozygotic twins discordant for BPD, plus the original enrichment sample cohort (the three never-medicated BPD patients and 13 matched control subjects), and a sample of experimental replicates (n=34). 83% of the independent test sample was correctly predicted, with a sensitivity of 67% and specificity of 100% (although this result did not reach statistical significance). Additionally, 88% of sample diagnostic classes were classified correctly for both the enrichment (p=0.015) and the replicate samples (p<0.001). We have developed a peripheral gene expression biomarker profile, that can classify healthy controls from patients with BPD receiving antipsychotic or mood stabilizing medication, which has both high sensitivity and specificity. Moreover, assay of three first-episode patients who had never received such medications, to first enrich the expression dataset for disease-related genes independent of medication effects, and then to test the 10-gene predictor, validates the peripheral biomarker approach for BPD.
Collapse
Affiliation(s)
- Catherine L Clelland
- Department of Pathology and Cell Biology, and Taub Institute for Research on Alzheimer's Disease and the Aging Brain, Columbia University Medical Center, New York, New York, United States of America.
| | | | | | | | | | | |
Collapse
|
430
|
Smeekens SP, Ng A, Kumar V, Johnson MD, Plantinga TS, van Diemen C, Arts P, Verwiel ETP, Gresnigt MS, Fransen K, van Sommeren S, Oosting M, Cheng SC, Joosten LAB, Hoischen A, Kullberg BJ, Scott WK, Perfect JR, van der Meer JWM, Wijmenga C, Netea MG, Xavier RJ. Functional genomics identifies type I interferon pathway as central for host defense against Candida albicans. Nat Commun 2013; 4:1342. [PMID: 23299892 PMCID: PMC3625375 DOI: 10.1038/ncomms2343] [Citation(s) in RCA: 145] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2012] [Accepted: 11/29/2012] [Indexed: 01/30/2023] Open
Abstract
Candida albicans is the most common human fungal pathogen causing mucosal and systemic infections. However, human antifungal immunity remains poorly defined. Here, by integrating transcriptional analysis and functional genomics, we identified Candida-specific host defense mechanisms in humans. Candida induced significant expression of genes from the type I interferon (IFN) pathway in human peripheral blood mononuclear cells. This unexpectedly prominent role of type I IFN pathway in anti-Candida host defense was supported by additional evidence. Polymorphisms in type I IFN genes modulated Candida-induced cytokine production and were correlated with susceptibility to systemic candidiasis. In in-vitro experiments, type I IFNs skewed Candida-induced inflammation from a Th17-response toward a Th1-response. Patients with chronic mucocutaneaous candidiasis displayed defective expression of genes in the type I IFN pathway. These findings indicate that the type I IFN pathway is a main signature of Candida-induced inflammation and plays a crucial role in anti-Candida host defense in humans.
Collapse
Affiliation(s)
- Sanne P Smeekens
- Department of Medicine (463), Radboud University Nijmegen Medical Centre, Geert Grooteplein Zuid 8, 6525GA Nijmegen, The Netherlands
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
431
|
Liang Y, Liu C, Luan XZ, Leung KS, Chan TM, Xu ZB, Zhang H. Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics 2013; 14:198. [PMID: 23777239 PMCID: PMC3718705 DOI: 10.1186/1471-2105-14-198] [Citation(s) in RCA: 78] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2012] [Accepted: 05/30/2013] [Indexed: 11/21/2022] Open
Abstract
Background Microarray technology is widely used in cancer diagnosis. Successfully identifying gene biomarkers will significantly help to classify different cancer types and improve the prediction accuracy. The regularization approach is one of the effective methods for gene selection in microarray data, which generally contain a large number of genes and have a small number of samples. In recent years, various approaches have been developed for gene selection of microarray data. Generally, they are divided into three categories: filter, wrapper and embedded methods. Regularization methods are an important embedded technique and perform both continuous shrinkage and automatic gene selection simultaneously. Recently, there is growing interest in applying the regularization techniques in gene selection. The popular regularization technique is Lasso (L1), and many L1 type regularization terms have been proposed in the recent years. Theoretically, the Lq type regularization with the lower value of q would lead to better solutions with more sparsity. Moreover, the L1/2 regularization can be taken as a representative of Lq (0 <q < 1) regularizations and has been demonstrated many attractive properties. Results In this work, we investigate a sparse logistic regression with the L1/2 penalty for gene selection in cancer classification problems, and propose a coordinate descent algorithm with a new univariate half thresholding operator to solve the L1/2 penalized logistic regression. Experimental results on artificial and microarray data demonstrate the effectiveness of our proposed approach compared with other regularization methods. Especially, for 4 publicly available gene expression datasets, the L1/2 regularization method achieved its success using only about 2 to 14 predictors (genes), compared to about 6 to 38 genes for ordinary L1 and elastic net regularization approaches. Conclusions From our evaluations, it is clear that the sparse logistic regression with the L1/2 penalty achieves higher classification accuracy than those of ordinary L1 and elastic net regularization approaches, while fewer but informative genes are selected. This is an important consideration for screening and diagnostic applications, where the goal is often to develop an accurate test using as few features as possible in order to control cost. Therefore, the sparse logistic regression with the L1/2 penalty is effective technique for gene selection in real classification problems.
Collapse
Affiliation(s)
- Yong Liang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Macau, China.
| | | | | | | | | | | | | |
Collapse
|
432
|
BreastPRS is a gene expression assay that stratifies intermediate-risk Oncotype DX patients into high- or low-risk for disease recurrence. Breast Cancer Res Treat 2013; 139:705-15. [PMID: 23774991 DOI: 10.1007/s10549-013-2604-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2013] [Accepted: 06/08/2013] [Indexed: 10/26/2022]
Abstract
Molecular prognostic assays, such as Oncotype DX, are increasingly incorporated into the management of patients with invasive breast carcinoma. BreastPRS is a new molecular assay developed and validated from a meta-analysis of publically available genomic datasets. We applied the assay to matched fresh-frozen (FF) and formalin-fixed paraffin-embedded (FFPE) tumor samples to translate the assay to FFPE. A linear relationship of the BreastPRS prognostic score was observed between tissue preservation formats. BreastPRS recurrence scores were compared with Oncotype DX recurrence scores from 246 patients with invasive breast carcinoma and known Oncotype DX results. Using this series, a 120-gene Oncotype DX approximation algorithm was trained to predict Oncotype DX risk groups and then applied to series of untreated, node-negative, estrogen receptor (ER)-positive patients from previously published studies with known clinical outcomes. Correlation of recurrence score and risk group between Oncotype DX and BreastPRS was statistically significant (P < 0.0001). 59 of 260 (23 %) patients from four previously published studies were classified as intermediate-risk when the 120-gene Oncotype DX approximation algorithm was applied. BreastPRS reclassified the 59 patients into binary risk groups (high- vs. low-risk). 23 (39 %) patients were classified as low-risk and 36 (61 %) as high-risk (P = 0.029, HR: 3.64, 95 % CI: 1.40-9.50). At 10 years from diagnosis, the low-risk group had a 90 % recurrence-free survival (RFS) rate compared to 60 % for the high-risk group. BreastPRS recurrence score is comparable with Oncotype DX and can reclassify Oncotype DX intermediate-risk patients into two groups with significant differences in RFS. Further studies are needed to validate these findings.
Collapse
|
433
|
Wu MY, Dai DQ, Zhang XF, Zhu Y. Cancer subtype discovery and biomarker identification via a new robust network clustering algorithm. PLoS One 2013; 8:e66256. [PMID: 23799085 PMCID: PMC3684607 DOI: 10.1371/journal.pone.0066256] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2013] [Accepted: 05/02/2013] [Indexed: 11/29/2022] Open
Abstract
In cancer biology, it is very important to understand the phenotypic changes of the patients and discover new cancer subtypes. Recently, microarray-based technologies have shed light on this problem based on gene expression profiles which may contain outliers due to either chemical or electrical reasons. These undiscovered subtypes may be heterogeneous with respect to underlying networks or pathways, and are related with only a few of interdependent biomarkers. This motivates a need for the robust gene expression-based methods capable of discovering such subtypes, elucidating the corresponding network structures and identifying cancer related biomarkers. This study proposes a penalized model-based Student’s t clustering with unconstrained covariance (PMT-UC) to discover cancer subtypes with cluster-specific networks, taking gene dependencies into account and having robustness against outliers. Meanwhile, biomarker identification and network reconstruction are achieved by imposing an adaptive penalty on the means and the inverse scale matrices. The model is fitted via the expectation maximization algorithm utilizing the graphical lasso. Here, a network-based gene selection criterion that identifies biomarkers not as individual genes but as subnetworks is applied. This allows us to implicate low discriminative biomarkers which play a central role in the subnetwork by interconnecting many differentially expressed genes, or have cluster-specific underlying network structures. Experiment results on simulated datasets and one available cancer dataset attest to the effectiveness, robustness of PMT-UC in cancer subtype discovering. Moveover, PMT-UC has the ability to select cancer related biomarkers which have been verified in biochemical or biomedical research and learn the biological significant correlation among genes.
Collapse
Affiliation(s)
- Meng-Yun Wu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Dao-Qing Dai
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
- * E-mail:
| | - Xiao-Fei Zhang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Yuan Zhu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
- Department of Mathematics, Guangdong University of Business Studies, Guangzhou, China
| |
Collapse
|
434
|
Abstract
Classification and prediction problems using spectral data lead to high-dimensional data sets. Spectral data are, however, different from most other high-dimensional data sets in that information usually varies smoothly with wavelength, suggesting that fitted models should also vary smoothly with wavelength. Functional data analysis, widely used in the analysis of spectral data, meets this objective by changing perspective from the raw spectra to approximations using smooth basis functions. This paper explores linear regression and linear discriminant analysis fitted directly to the spectral data, imposing penalties on the values and roughness of the fitted coefficients, and shows by example that this can lead to better fits than existing standard methodologies.
Collapse
|
435
|
Fallon BP, Curnutte B, Maupin KA, Partyka K, Choi S, Brand RE, Langmead CJ, Tembe W, Haab BB. The Marker State Space (MSS) method for classifying clinical samples. PLoS One 2013; 8:e65905. [PMID: 23750276 PMCID: PMC3672150 DOI: 10.1371/journal.pone.0065905] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Accepted: 04/30/2013] [Indexed: 01/30/2023] Open
Abstract
The development of accurate clinical biomarkers has been challenging in part due to the diversity between patients and diseases. One approach to account for the diversity is to use multiple markers to classify patients, based on the concept that each individual marker contributes information from its respective subclass of patients. Here we present a new strategy for developing biomarker panels that accounts for completely distinct patient subclasses. Marker State Space (MSS) defines “marker states” based on all possible patterns of high and low values among a panel of markers. Each marker state is defined as either a case state or a control state, and a sample is classified as case or control based on the state it occupies. MSS was used to define multi-marker panels that were robust in cross validation and training-set/test-set analyses and that yielded similar classification accuracy to several other classification algorithms. A three-marker panel for discriminating pancreatic cancer patients from control subjects revealed subclasses of patients based on distinct marker states. MSS provides a straightforward approach for modeling highly divergent subclasses of patients, which may be adaptable for diverse applications.
Collapse
Affiliation(s)
- Brian P. Fallon
- Laboratory of Cancer Immunodiagnostics, Van Andel Institute, Grand Rapids, Michigan, United States of America
| | - Bryan Curnutte
- Laboratory of Cancer Immunodiagnostics, Van Andel Institute, Grand Rapids, Michigan, United States of America
| | - Kevin A. Maupin
- Laboratory of Cancer Immunodiagnostics, Van Andel Institute, Grand Rapids, Michigan, United States of America
| | - Katie Partyka
- Laboratory of Cancer Immunodiagnostics, Van Andel Institute, Grand Rapids, Michigan, United States of America
| | - Sunguk Choi
- Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Randall E. Brand
- University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania, United States of America
| | | | - Waibhav Tembe
- Translational Genomics Research Institute, Phoenix, Arizona, United States of America
| | - Brian B. Haab
- Laboratory of Cancer Immunodiagnostics, Van Andel Institute, Grand Rapids, Michigan, United States of America
- * E-mail:
| |
Collapse
|
436
|
Hosseinzadeh F, Kayvanjoo AH, Ebrahimi M, Goliaei B. Prediction of lung tumor types based on protein attributes by machine learning algorithms. SPRINGERPLUS 2013; 2:238. [PMID: 23888262 PMCID: PMC3710575 DOI: 10.1186/2193-1801-2-238] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Accepted: 03/21/2013] [Indexed: 01/15/2023]
Abstract
Early diagnosis of lung cancers and distinction between the tumor types (Small Cell Lung Cancer (SCLC) and Non-Small Cell Lung Cancer (NSCLC) are very important to increase the survival rate of patients. Herein, we propose a diagnostic system based on sequence-derived structural and physicochemical attributes of proteins that involved in both types of tumors via feature extraction, feature selection and prediction models. 1497 proteins attributes computed and important features selected by 12 attribute weighting models and finally machine learning models consist of seven SVM models, three ANN models and two NB models applied on original database and newly created ones from attribute weighting models; models accuracies calculated through 10-fold cross and wrapper validation (just for SVM algorithms). In line with our previous findings, dipeptide composition, autocorrelation and distribution descriptor were the most important protein features selected by bioinformatics tools. The algorithms performances in lung cancer tumor type prediction increased when they applied on datasets created by attribute weighting models rather than original dataset. Wrapper-Validation performed better than X-Validation; the best cancer type prediction resulted from SVM and SVM Linear models (82%). The best accuracy of ANN gained when Neural Net model applied on SVM dataset (88%). This is the first report suggesting that the combination of protein features and attribute weighting models with machine learning algorithms can be effectively used to predict the type of lung cancer tumors (SCLC and NSCLC).
Collapse
Affiliation(s)
- Faezeh Hosseinzadeh
- Laboratory of biophysics and molecular biology, Institute of Biophysics and Biochemistry (IBB), University of Tehran, Tehran, Iran
| | | | | | | |
Collapse
|
437
|
Fridlyand J, Yeh RF, Mackey H, Bengtsson T, Delmar P, Spaniolo G, Lieberman G. An industry statistician's perspective on PHC drug development. Contemp Clin Trials 2013; 36:624-35. [PMID: 23648396 DOI: 10.1016/j.cct.2013.04.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2013] [Revised: 04/11/2013] [Accepted: 04/25/2013] [Indexed: 10/26/2022]
Abstract
In the past decade, the cost of drug development has increased significantly. The estimates vary widely but frequently quoted numbers are staggering-it takes 10-15 years and billions of dollars to bring a drug to patients. To a large extent this is due to many long, expensive and ultimately unsuccessful drug trials. While one approach to combat the low yield on investment could be to continue searching for new blockbusters, an alternative method would lead us to focus on testing new targeted treatments that have a strong underlying scientific rationale and are more likely to provide enhanced clinical benefit in population subsets defined by molecular diagnostics. Development of these new treatments, however, cannot follow the usual established path; new strategies and approaches are required for the co-development of novel therapeutics and the diagnostic. In this paper we will review, from the point of view of industry, the approaches to, and challenges of drug development strategies incorporating predictive biomarkers into clinical programs. We will outline the basic concepts behind co-development with predictive biomarkers and summarize the current regulatory paradigm. We will present guiding principles of personalized health care (PHC) development and review the statistical, strategic, regulatory and operational challenges that statisticians regularly encounter on development programs with a PHC component. Some practical recommendations for team statisticians involved in PHC drug development are included. The majority of the examples and recommendations are drawn from oncology but broader concepts apply across all therapeutic areas.
Collapse
|
438
|
Roy A, Mackin PD, Mukhopadhyay S. Methods for pattern selection, class-specific feature selection and classification for automated learning. Neural Netw 2013; 41:113-29. [DOI: 10.1016/j.neunet.2012.12.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2012] [Revised: 10/15/2012] [Accepted: 12/17/2012] [Indexed: 11/28/2022]
|
439
|
Drake JI, Gomez-Arroyo J, Dumur CI, Kraskauskas D, Natarajan R, Bogaard HJ, Fawcett P, Voelkel NF. Chronic carvedilol treatment partially reverses the right ventricular failure transcriptional profile in experimental pulmonary hypertension. Physiol Genomics 2013; 45:449-61. [PMID: 23632417 DOI: 10.1152/physiolgenomics.00166.2012] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Right ventricular failure (RVF) is the most frequent cause of death in patients with pulmonary arterial hypertension (PAH); however, specific therapies targeted to treat RVF have not been developed. Chronic treatment with carvedilol has been shown to reduce established maladaptive right ventricle (RV) hypertrophy and to improve RV function in experimental PAH. However, the mechanisms by which carvedilol improves RVF are unknown. We have previously demonstrated by microarray analysis that RVF is characterized by a distinct gene expression profile when compared with functional, compensatory hypertrophy. We next sought to identify the effects of carvedilol on gene expression on a genome-wide basis. PAH and RVF were induced in male Sprague-Dawley rats by the combination of VEGF-receptor blockade and chronic hypoxia. After RVF was established, rats were treated with carvedilol or vehicle for 4 wk. RNA was isolated from RV tissue and hybridized for microarray analysis. An initial prediction analysis of carvedilol-treated RVs showed that the gene expression profile resembled the RVF prediction set. However, a more extensive analysis revealed a small group of genes differentially expressed after carvedilol treatment. Further analysis categorized these genes in pathways involved in cardiac hypertrophy, mitochondrial dysfunction, and protein ubiquitination. Genes encoding proteins in the cardiac hypertrophy and protein ubiquitination pathways were downregulated in the RV by carvedilol, while genes encoding proteins in the mitochondrial dysfunction pathway were upregulated by carvedilol. These gene expression changes may explain some of the mechanisms that underlie the functional improvement of the RV after carvedilol treatment.
Collapse
Affiliation(s)
- Jennifer I Drake
- Victoria Johnson Center for Lung Obstructive Disease Research, Virginia Commonwealth University, Richmond, Virginia, USA
| | | | | | | | | | | | | | | |
Collapse
|
440
|
Abstract
This article proposes a method for multiclass classification problems using ensembles of multinomial logistic regression models. A multinomial logit model is used as a base classifier in ensembles from random partitions of predictors. The multinomial logit model can be applied to each mutually exclusive subset of the feature space without variable selection. By combining multiple models the proposed method can handle a huge database without a constraint needed for analyzing high-dimensional data, and the random partition can improve the prediction accuracy by reducing the correlation among base classifiers. The proposed method is implemented using R, and the performance including overall prediction accuracy, sensitivity, and specificity for each category is evaluated on two real data sets and simulation data sets. To investigate the quality of prediction in terms of sensitivity and specificity, the area under the receiver operating characteristic (ROC) curve (AUC) is also examined. The performance of the proposed model is compared to a single multinomial logit model and it shows a substantial improvement in overall prediction accuracy. The proposed method is also compared with other classification methods such as the random forest, support vector machines, and random multinomial logit model.
Collapse
Affiliation(s)
- Kyewon Lee
- Department of Applied Mathematics and Statistics , Stony Brook University , Stony Brook , NY, USA
| | | | | | | | | |
Collapse
|
441
|
Genomic biomarkers for personalized medicine: development and validation in clinical studies. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2013; 2013:865980. [PMID: 23690882 PMCID: PMC3652056 DOI: 10.1155/2013/865980] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2013] [Accepted: 03/22/2013] [Indexed: 12/26/2022]
Abstract
The establishment of high-throughput technologies has brought substantial advances to our understanding of the biology of many diseases at the molecular level and increasing expectations on the development of innovative molecularly targeted treatments and molecular biomarkers or diagnostic tests in the context of clinical studies. In this review article, we position the two critical statistical analyses of high-dimensional genomic data, gene screening and prediction, in the framework of development and validation of genomic biomarkers or signatures, through taking into consideration the possible different strategies for developing genomic signatures. A wide variety of biomarker-based clinical trial designs to assess clinical utility of a biomarker or a new treatment with a companion biomarker are also discussed.
Collapse
|
442
|
Winslow RL, Trayanova N, Geman D, Miller MI. Computational medicine: translating models to clinical care. Sci Transl Med 2013; 4:158rv11. [PMID: 23115356 DOI: 10.1126/scitranslmed.3003528] [Citation(s) in RCA: 119] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Because of the inherent complexity of coupled nonlinear biological systems, the development of computational models is necessary for achieving a quantitative understanding of their structure and function in health and disease. Statistical learning is applied to high-dimensional biomolecular data to create models that describe relationships between molecules and networks. Multiscale modeling links networks to cells, organs, and organ systems. Computational approaches are used to characterize anatomic shape and its variations in health and disease. In each case, the purposes of modeling are to capture all that we know about disease and to develop improved therapies tailored to the needs of individuals. We discuss advances in computational medicine, with specific examples in the fields of cancer, diabetes, cardiology, and neurology. Advances in translating these computational methods to the clinic are described, as well as challenges in applying models for improving patient health.
Collapse
Affiliation(s)
- Raimond L Winslow
- The Institute for Computational Medicine, Center for Cardiovascular Bioinformatics and Modeling, and Department of Biomedical Engineering, The Johns Hopkins University School of Medicine, Baltimore, MD 21218, USA.
| | | | | | | |
Collapse
|
443
|
de Ridder D, de Ridder J, Reinders MJT. Pattern recognition in bioinformatics. Brief Bioinform 2013; 14:633-47. [DOI: 10.1093/bib/bbt020] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
444
|
Ding W, Qiu P, Liu YH, Feng W. Current Omics Technologies in Biomarker Discovery. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Biomarkers are playing an increasingly important role in drug discovery and development and can be applied for many purposes, including disease mechanism study, diagnosis, prognosis, staging, and treatment selection. Advances in high-throughput “omics” technologies, including genomics, transcriptomics, proteomics and metabolomics, significantly accelerate the pace of biomarker discovery. Comprehensive molecular profiling using these “omics” technology has become a field of intensive research aiming at identifying biomarkers relevant for improved diagnostics and therapeutics. Although each “omics” technology plays important roles in biomarker research, different “omics” platforms have different strengths and limitations. This chapter aims to give an overview of these “omics” technologies and their current application in the biomarker discovery.
Collapse
|
445
|
Lin TC, Liu RS, Chao YT, Chen SY. Classifying subtypes of acute lymphoblastic leukemia using silhouette statistics and genetic algorithms. Gene 2013; 518:159-63. [DOI: 10.1016/j.gene.2012.11.046] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2012] [Accepted: 11/27/2012] [Indexed: 11/27/2022]
|
446
|
Genotype-phenotype matching analysis of 38 Lactococcus lactis strains using random forest methods. BMC Microbiol 2013; 13:68. [PMID: 23530958 PMCID: PMC3637802 DOI: 10.1186/1471-2180-13-68] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2012] [Accepted: 03/20/2013] [Indexed: 12/22/2022] Open
Abstract
Background Lactococcus lactis is used in dairy food fermentation and for the efficient production of industrially relevant enzymes. The genome content and different phenotypes have been determined for multiple L. lactis strains in order to understand intra-species genotype and phenotype diversity and annotate gene functions. In this study, we identified relations between gene presence and a collection of 207 phenotypes across 38 L. lactis strains of dairy and plant origin. Gene occurrence and phenotype data were used in an iterative gene selection procedure, based on the Random Forest algorithm, to identify genotype-phenotype relations. Results A total of 1388 gene-phenotype relations were found, of which some confirmed known gene-phenotype relations, such as the importance of arabinose utilization genes only for strains of plant origin. We also identified a gene cluster related to growth on melibiose, a plant disaccharide; this cluster is present only in melibiose-positive strains and can be used as a genetic marker in trait improvement. Additionally, several novel gene-phenotype relations were uncovered, for instance, genes related to arsenite resistance or arginine metabolism. Conclusions Our results indicate that genotype-phenotype matching by integrating large data sets provides the possibility to identify gene-phenotype relations, possibly improve gene function annotation and identified relations can be used for screening bacterial culture collections for desired phenotypes. In addition to all gene-phenotype relations, we also provide coherent phenotype data for 38 Lactococcus strains assessed in 207 different phenotyping experiments, which to our knowledge is the largest to date for the Lactococcus lactis species.
Collapse
|
447
|
Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics 2013; 14:106. [PMID: 23522326 PMCID: PMC3648438 DOI: 10.1186/1471-2105-14-106] [Citation(s) in RCA: 342] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2012] [Accepted: 02/22/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data. RESULTS While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data. CONCLUSIONS In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.
Collapse
Affiliation(s)
- Rok Blagus
- Institute for Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | | |
Collapse
|
448
|
Ullah S, Finch CF. Applications of functional data analysis: A systematic review. BMC Med Res Methodol 2013; 13:43. [PMID: 23510439 PMCID: PMC3626842 DOI: 10.1186/1471-2288-13-43] [Citation(s) in RCA: 95] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2012] [Accepted: 03/04/2013] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Functional data analysis (FDA) is increasingly being used to better analyze, model and predict time series data. Key aspects of FDA include the choice of smoothing technique, data reduction, adjustment for clustering, functional linear modeling and forecasting methods. METHODS A systematic review using 11 electronic databases was conducted to identify FDA application studies published in the peer-review literature during 1995-2010. Papers reporting methodological considerations only were excluded, as were non-English articles. RESULTS In total, 84 FDA application articles were identified; 75.0% of the reviewed articles have been published since 2005. Application of FDA has appeared in a large number of publications across various fields of sciences; the majority is related to biomedicine applications (21.4%). Overall, 72 studies (85.7%) provided information about the type of smoothing techniques used, with B-spline smoothing (29.8%) being the most popular. Functional principal component analysis (FPCA) for extracting information from functional data was reported in 51 (60.7%) studies. One-quarter (25.0%) of the published studies used functional linear models to describe relationships between explanatory and outcome variables and only 8.3% used FDA for forecasting time series data. CONCLUSIONS Despite its clear benefits for analyzing time series data, full appreciation of the key features and value of FDA have been limited to date, though the applications show its relevance to many public health and biomedical problems. Wider application of FDA to all studies involving correlated measurements should allow better modeling of, and predictions from, such data in the future especially as FDA makes no a priori age and time effects assumptions.
Collapse
Affiliation(s)
- Shahid Ullah
- Flinders Centre for Epidemiology and Biostatistics, School of Medicine, Faculty of Health Sciences, Flinders University, Adelaide, SA, 5001, Australia
| | - Caroline F Finch
- Centre for Healthy and Safe Sports (CHASS), University of Ballarat, SMB Campus, Ballarat, VIC, 3353, Australia
| |
Collapse
|
449
|
Stem cell-like gene expression in ovarian cancer predicts type II subtype and prognosis. PLoS One 2013; 8:e57799. [PMID: 23536770 PMCID: PMC3594231 DOI: 10.1371/journal.pone.0057799] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2012] [Accepted: 01/29/2013] [Indexed: 01/04/2023] Open
Abstract
Although ovarian cancer is often initially chemotherapy-sensitive, the vast majority of tumors eventually relapse and patients die of increasingly aggressive disease. Cancer stem cells are believed to have properties that allow them to survive therapy and may drive recurrent tumor growth. Cancer stem cells or cancer-initiating cells are a rare cell population and difficult to isolate experimentally. Genes that are expressed by stem cells may characterize a subset of less differentiated tumors and aid in prognostic classification of ovarian cancer. The purpose of this study was the genomic identification and characterization of a subtype of ovarian cancer that has stem cell-like gene expression. Using human and mouse gene signatures of embryonic, adult, or cancer stem cells, we performed an unsupervised bipartition class discovery on expression profiles from 145 serous ovarian tumors to identify a stem-like and more differentiated subgroup. Subtypes were reproducible and were further characterized in four independent, heterogeneous ovarian cancer datasets. We identified a stem-like subtype characterized by a 51-gene signature, which is significantly enriched in tumors with properties of Type II ovarian cancer; high grade, serous tumors, and poor survival. Conversely, the differentiated tumors share properties with Type I, including lower grade and mixed histological subtypes. The stem cell-like signature was prognostic within high-stage serous ovarian cancer, classifying a small subset of high-stage tumors with better prognosis, in the differentiated subtype. In multivariate models that adjusted for common clinical factors (including grade, stage, age), the subtype classification was still a significant predictor of relapse. The prognostic stem-like gene signature yields new insights into prognostic differences in ovarian cancer, provides a genomic context for defining Type I/II subtypes, and potential gene targets which following further validation may be valuable in the clinical management or treatment of ovarian cancer.
Collapse
|
450
|
Ramey JA, Young PD. A comparison of regularization methods applied to the linear discriminant function with high-dimensional microarray data. J STAT COMPUT SIM 2013. [DOI: 10.1080/00949655.2011.625946] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|