451
|
Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinformatics 2013; 14:64. [PMID: 23433084 PMCID: PMC3687811 DOI: 10.1186/1471-2105-14-64] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2012] [Accepted: 01/31/2013] [Indexed: 11/21/2022] Open
Abstract
Background PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate. Results We show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means). Conclusions The results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data.
Collapse
|
452
|
Hajiloo M, Sapkota Y, Mackey JR, Robson P, Greiner R, Damaraju S. ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction. BMC Bioinformatics 2013; 14:61. [PMID: 23432980 PMCID: PMC3618021 DOI: 10.1186/1471-2105-14-61] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2013] [Accepted: 02/14/2013] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case-control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification. RESULTS We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual's continental and sub-continental ancestry. To predict an individual's continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control's λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver. CONCLUSIONS ETHNOPRED is a novel technique for producing classifiers that can identify an individual's continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values.
Collapse
Affiliation(s)
- Mohsen Hajiloo
- Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
- Alberta Innovates Centre for Machine Learning, University of Alberta, Edmonton, Alberta, Canada
| | - Yadav Sapkota
- Department of Laboratory Medicine and Pathology, University of Alberta, Edmonton, Alberta, Canada
- Cancer Care, Alberta Health Services, Edmonton, Alberta, Canada
| | - John R Mackey
- Department of Oncology, University of Alberta, Edmonton, Alberta, Canada
- Cancer Care, Alberta Health Services, Edmonton, Alberta, Canada
| | - Paula Robson
- Cancer Care, Alberta Health Services, Edmonton, Alberta, Canada
- Department of Agricultural, Food and Nutritional Sciences, University of Alberta, Edmonton, Alberta, Canada
| | - Russell Greiner
- Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
- Alberta Innovates Centre for Machine Learning, University of Alberta, Edmonton, Alberta, Canada
| | - Sambasivarao Damaraju
- Department of Laboratory Medicine and Pathology, University of Alberta, Edmonton, Alberta, Canada
- Cancer Care, Alberta Health Services, Edmonton, Alberta, Canada
| |
Collapse
|
453
|
Ulfenborg B, Klinga-Levan K, Olsson B. Classification of tumor samples from expression data using decision trunks. Cancer Inform 2013; 12:53-66. [PMID: 23467331 PMCID: PMC3579425 DOI: 10.4137/cin.s10356] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
We present a novel machine learning approach for the classification of cancer samples using expression data. We refer to the method as “decision trunks,” since it is loosely based on decision trees, but contains several modifications designed to achieve an algorithm that: (1) produces smaller and more easily interpretable classifiers than decision trees; (2) is more robust in varying application scenarios; and (3) achieves higher classification accuracy. The decision trunk algorithm has been implemented and tested on 26 classification tasks, covering a wide range of cancer forms, experimental methods, and classification scenarios. This comprehensive evaluation indicates that the proposed algorithm performs at least as well as the current state of the art algorithms in terms of accuracy, while producing classifiers that include on average only 2–3 markers. We suggest that the resulting decision trunks have clear advantages over other classifiers due to their transparency, interpretability, and their correspondence with human decision-making and clinical testing practices.
Collapse
Affiliation(s)
- Benjamin Ulfenborg
- Systems Biology Research Centre, School of Life Sciences, University of Skövde, Skövde, Sweden
| | | | | |
Collapse
|
454
|
An extension of PPLS-DA for classification and comparison to ordinary PLS-DA. PLoS One 2013; 8:e55267. [PMID: 23408965 PMCID: PMC3569448 DOI: 10.1371/journal.pone.0055267] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2012] [Accepted: 12/27/2012] [Indexed: 11/19/2022] Open
Abstract
Classification studies are widely applied, e.g. in biomedical research to classify objects/patients into predefined groups. The goal is to find a classification function/rule which assigns each object/patient to a unique group with the greatest possible accuracy (classification error). Especially in gene expression experiments often a lot of variables (genes) are measured for only few objects/patients. A suitable approach is the well-known method PLS-DA, which searches for a transformation to a lower dimensional space. Resulting new components are linear combinations of the original variables. An advancement of PLS-DA leads to PPLS-DA, introducing a so called ‘power parameter’, which is maximized towards the correlation between the components and the group-membership. We introduce an extension of PPLS-DA for optimizing this power parameter towards the final aim, namely towards a minimal classification error. We compare this new extension with the original PPLS-DA and also with the ordinary PLS-DA using simulated and experimental datasets. For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA. A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error. On the contrary, for the data set with strong between-feature collinearity and a low proportion of differentially expressed genes and a large total number of genes, the prediction error of PPLS-DA and the extensions is clearly lower than for PLS-DA. Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.
Collapse
|
455
|
Telaar A, Repsilber D, Nürnberg G. Biomarker discovery: classification using pooled samples. Comput Stat 2013. [DOI: 10.1007/s00180-011-0302-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
456
|
Papillon-Cavanagh S, De Jay N, Hachem N, Olsen C, Bontempi G, Aerts HJWL, Quackenbush J, Haibe-Kains B. Comparison and validation of genomic predictors for anticancer drug sensitivity. J Am Med Inform Assoc 2013; 20:597-602. [PMID: 23355484 DOI: 10.1136/amiajnl-2012-001442] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND An enduring challenge in personalized medicine lies in selecting the right drug for each individual patient. While testing of drugs on patients in large trials is the only way to assess their clinical efficacy and toxicity, we dramatically lack resources to test the hundreds of drugs currently under development. Therefore the use of preclinical model systems has been intensively investigated as this approach enables response to hundreds of drugs to be tested in multiple cell lines in parallel. METHODS Two large-scale pharmacogenomic studies recently screened multiple anticancer drugs on over 1000 cell lines. We propose to combine these datasets to build and robustly validate genomic predictors of drug response. We compared five different approaches for building predictors of increasing complexity. We assessed their performance in cross-validation and in two large validation sets, one containing the same cell lines present in the training set and another dataset composed of cell lines that have never been used during the training phase. RESULTS Sixteen drugs were found in common between the datasets. We were able to validate multivariate predictors for three out of the 16 tested drugs, namely irinotecan, PD-0325901, and PLX4720. Moreover, we observed that response to 17-AAG, an inhibitor of Hsp90, could be efficiently predicted by the expression level of a single gene, NQO1. CONCLUSION These results suggest that genomic predictors could be robustly validated for specific drugs. If successfully validated in patients' tumor cells, and subsequently in clinical trials, they could act as companion tests for the corresponding drugs and play an important role in personalized medicine.
Collapse
Affiliation(s)
- Simon Papillon-Cavanagh
- Bioinformatics and Computational Genomics Laboratory, Institut de recherches cliniques de Montréal, University of Montreal, Montreal, Quebec, Canada
| | | | | | | | | | | | | | | |
Collapse
|
457
|
Gadegbeku CA, Gipson DS, Holzman LB, Ojo AO, Song PXK, Barisoni L, Sampson MG, Kopp JB, Lemley KV, Nelson PJ, Lienczewski CC, Adler SG, Appel GB, Cattran DC, Choi MJ, Contreras G, Dell KM, Fervenza FC, Gibson KL, Greenbaum LA, Hernandez JD, Hewitt SM, Hingorani SR, Hladunewich M, Hogan MC, Hogan SL, Kaskel FJ, Lieske JC, Meyers KEC, Nachman PH, Nast CC, Neu AM, Reich HN, Sedor JR, Sethna CB, Trachtman H, Tuttle KR, Zhdanova O, Zilleruelo GE, Kretzler M. Design of the Nephrotic Syndrome Study Network (NEPTUNE) to evaluate primary glomerular nephropathy by a multidisciplinary approach. Kidney Int 2013; 83:749-56. [PMID: 23325076 PMCID: PMC3612359 DOI: 10.1038/ki.2012.428] [Citation(s) in RCA: 255] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
The Nephrotic Syndrome Study Network (NEPTUNE) is a North American multicenter collaborative consortium established to develop a translational research infrastructure for nephrotic syndrome. This includes a longitudinal observational cohort study, a pilot and ancillary study program, a training program, and a patient contact registry. NEPTUNE will enroll 450 adults and children with minimal change disease, focal segmental glomerulosclerosis, and membranous nephropathy for detailed clinical, histopathological, and molecular phenotyping at the time of clinically indicated renal biopsy. Initial visits will include an extensive clinical history, physical examination, collection of urine, blood and renal tissue samples, and assessments of quality of life and patient-reported outcomes. Follow-up history, physical measures, urine and blood samples, and questionnaires will be obtained every 4 months in the first year and biannually, thereafter. Molecular profiles and gene expression data will be linked to phenotypic, genetic, and digitalized histological data for comprehensive analyses using systems biology approaches. Analytical strategies were designed to transform descriptive information to mechanistic disease classification for nephrotic syndrome and to identify clinical, histological, and genomic disease predictors. Thus, understanding the complexity of the disease pathogenesis will guide further investigation for targeted therapeutic strategies.
Collapse
|
458
|
Song L, Langfelder P, Horvath S. Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics 2013; 14:5. [PMID: 23323760 PMCID: PMC3645958 DOI: 10.1186/1471-2105-14-5] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2012] [Accepted: 01/03/2013] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND Ensemble predictors such as the random forest are known to have superior accuracy but their black-box predictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretable especially when forward feature selection is used to construct the model. However, forward feature selection tends to overfit the data and leads to low predictive accuracy. Therefore, it remains an important research goal to combine the advantages of ensemble predictors (high accuracy) with the advantages of forward regression modeling (interpretability). To address this goal several articles have explored GLM based ensemble predictors. Since limited evaluations suggested that these ensemble predictors were less accurate than alternative predictors, they have found little attention in the literature. RESULTS Comprehensive evaluations involving hundreds of genomic data sets, the UCI machine learning benchmark data, and simulations are used to give GLM based ensemble predictors a new and careful look. A novel bootstrap aggregated (bagged) GLM predictor that incorporates several elements of randomness and instability (random subspace method, optional interaction terms, forward variable selection) often outperforms a host of alternative prediction methods including random forests and penalized regression models (ridge regression, elastic net, lasso). This random generalized linear model (RGLM) predictor provides variable importance measures that can be used to define a "thinned" ensemble predictor (involving few features) that retains excellent predictive accuracy. CONCLUSION RGLM is a state of the art predictor that shares the advantages of a random forest (excellent predictive accuracy, feature importance measures, out-of-bag estimates of accuracy) with those of a forward selected generalized linear model (interpretability). These methods are implemented in the freely available R software package randomGLM.
Collapse
Affiliation(s)
- Lin Song
- Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California, USA
| | | | | |
Collapse
|
459
|
Rajapakse JC, Mundra PA. Multiclass gene selection using Pareto-fronts. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:87-97. [PMID: 23702546 DOI: 10.1109/tcbb.2013.1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Filter methods are often used for selection of genes in multiclass sample classification by using microarray data. Such techniques usually tend to bias toward a few classes that are easily distinguishable from other classes due to imbalances of strong features and sample sizes of different classes. It could therefore lead to selection of redundant genes while missing the relevant genes, leading to poor classification of tissue samples. In this manuscript, we propose to decompose multiclass ranking statistics into class-specific statistics and then use Pareto-front analysis for selection of genes. This alleviates the bias induced by class intrinsic characteristics of dominating classes. The use of Pareto-front analysis is demonstrated on two filter criteria commonly used for gene selection: F-score and KW-score. A significant improvement in classification performance and reduction in redundancy among top-ranked genes were achieved in experiments with both synthetic and real-benchmark data sets.
Collapse
|
460
|
Wagaman A. Efficient k-NN graph construction for graphs on variables. Stat Anal Data Min 2013. [DOI: 10.1002/sam.11186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
461
|
Discriminant and Class-Modelling Chemometric Techniques for Food PDO Verification. FOOD PROTECTED DESIGNATION OF ORIGIN - METHODOLOGIES AND APPLICATIONS 2013. [DOI: 10.1016/b978-0-444-59562-1.00013-x] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
|
462
|
|
463
|
|
464
|
Alonso-Betanzos A, Bolón-Canedo V, Fernández-Francos D, Porto-Díaz I, Sánchez-Maroño N. Up-to-Date Feature Selection Methods for Scalable and Efficient Machine Learning. EFFICIENCY AND SCALABILITY METHODS FOR COMPUTATIONAL INTELLECT 2013:1-26. [DOI: 10.4018/978-1-4666-3942-3.ch001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2025]
Abstract
With the advent of high dimensionality, machine learning researchers are now interested not only in accuracy, but also in scalability of algorithms. When dealing with large databases, pre-processing techniques are required to reduce input dimensionality and machine learning can take advantage of feature selection, which consists of selecting the relevant features and discarding irrelevant ones with a minimum degradation in performance. In this chapter, we will review the most up-to-date feature selection methods, focusing on their scalability properties. Moreover, we will show how these learning methods are enhanced when applied to large scale datasets and, finally, some examples of the application of feature selection in real world databases will be shown.
Collapse
|
465
|
|
466
|
|
467
|
Wang Y, Zhou Y, Li Y, Ling Z, Zhu Y, Guo X, Sun H. An improved dimensionality reduction method for meta-transcriptome indexing based diseases classification. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 3:S12. [PMID: 23281712 PMCID: PMC3524076 DOI: 10.1186/1752-0509-6-s3-s12] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
BACKGROUND Bacterial 16S Ribosomal RNAs profiling have been widely used in the classification of microbiota associated diseases. Dimensionality reduction is among the keys in mining high-dimensional 16S rRNAs' expression data. High levels of sparsity and redundancy are common in 16S rRNA gene microbial surveys. Traditional feature selection methods are generally restricted to measuring correlated abundances, and are limited in discrimination when so few microbes are actually shared across communities. RESULTS Here we present a Feature Merging and Selection algorithm (FMS) to deal with 16S rRNAs' expression data. By integrating Linear Discriminant Analysis method, FMS can reduce the feature dimension with higher accuracy and preserve the relationship between different features as well. Two 16S rRNAs' expression datasets of pneumonia and dental decay patients were used to test the validity of the algorithm. Combined with SVM, FMS discriminated different classes of both pneumonia and dental caries better than other popular feature selection methods. CONCLUSIONS FMS projects data into lower dimension with preservation of enough features, and thus improve the intelligibility of the result. The results showed that FMS is a more valid and reliable methods in feature reduction.
Collapse
Affiliation(s)
- Yin Wang
- College of Life Science and Biotechnology, Shanghai Jiaotong University, 800 Dongchuan Road, Shanghai 200240, China
| | - Yuhua Zhou
- Department of Medical Microbiology and Parasitology, Institutes of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai 200240, China
| | - Yixue Li
- College of Life Science and Biotechnology, Shanghai Jiaotong University, 800 Dongchuan Road, Shanghai 200240, China
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- Shanghai Center for Bioinformation Technology, Shanghai 200235, China
| | - Zongxin Ling
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, the First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, Zhejiang 310003, China
| | - Yan Zhu
- Department of Cardiology, Gansu Provincial Hospital, Lanzhou 730000, China
| | - Xiaokui Guo
- Department of Medical Microbiology and Parasitology, Institutes of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai 200240, China
| | - Hong Sun
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- Shanghai Center for Bioinformation Technology, Shanghai 200235, China
| |
Collapse
|
468
|
Fan Y, Tang CY. Tuning parameter selection in high dimensional penalized likelihood. J R Stat Soc Series B Stat Methodol 2012. [DOI: 10.1111/rssb.12001] [Citation(s) in RCA: 157] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Yingying Fan
- University of Southern California; Los Angeles; USA
| | | |
Collapse
|
469
|
Chen CK. The classification of cancer stage microarray data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2012; 108:1070-1077. [PMID: 22925656 DOI: 10.1016/j.cmpb.2012.07.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/30/2012] [Revised: 06/20/2012] [Accepted: 07/17/2012] [Indexed: 06/01/2023]
Abstract
Correctly diagnosing the cancer stage is most important for selecting an appropriate cancer treatment option for a patient. Recent advances in microarray technology allow the cancer stage to be predicted using gene expression patterns. The cancer stage is in ordinal scale. In this paper, we employ strict ordinal regressions including cumulative logit model in traditional statistics with data dimensionality reduction, and distribution free approaches of large margin rank boundaries implemented by the support vector machine, as well as an ensemble ranking scheme to model the cancer stage using gene expression microarray data. Predictive genes included in models are selected by univariate feature ranking, and recursive feature elimination. We perform cross-validation experiments to assess and compare classification accuracies of ordinal and non-ordinal algorithms on five cancer stage microarray datasets. We conclude that a strict ordinal classifier trained by a validated approach can predict the cancer stage more accurately than traditional non-ordinal classifiers without considering the order of cancer stages.
Collapse
Affiliation(s)
- Chi-Kan Chen
- Department of Applied Mathematics, National Chung Hosing University, Taiwan.
| |
Collapse
|
470
|
A model selection criterion for discriminant analysis of high-dimensional data with fewer observations. J Stat Plan Inference 2012. [DOI: 10.1016/j.jspi.2012.06.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
471
|
Hochrein J, Klein MS, Zacharias HU, Li J, Wijffels G, Schirra HJ, Spang R, Oefner PJ, Gronwald W. Performance Evaluation of Algorithms for the Classification of Metabolic 1H NMR Fingerprints. J Proteome Res 2012; 11:6242-51. [DOI: 10.1021/pr3009034] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Affiliation(s)
- Jochen Hochrein
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| | - Matthias S. Klein
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| | - Helena U. Zacharias
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| | - Juan Li
- CSIRO Livestock Industries, Queensland Bioscience Precinct, 306 Carmody Rd., St. Lucia, QLD
4067, Australia
| | - Gene Wijffels
- CSIRO Livestock Industries, Queensland Bioscience Precinct, 306 Carmody Rd., St. Lucia, QLD
4067, Australia
| | - Horst Joachim Schirra
- Centre for
Advanced Imaging, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Rainer Spang
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| | - Peter J. Oefner
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| | - Wolfram Gronwald
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| |
Collapse
|
472
|
Sanz-Pamplona R, Berenguer A, Cordero D, Riccadonna S, Solé X, Crous-Bou M, Guinó E, Sanjuan X, Biondo S, Soriano A, Jurman G, Capella G, Furlanello C, Moreno V. Clinical value of prognosis gene expression signatures in colorectal cancer: a systematic review. PLoS One 2012; 7:e48877. [PMID: 23145004 PMCID: PMC3492249 DOI: 10.1371/journal.pone.0048877] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2012] [Accepted: 10/02/2012] [Indexed: 12/01/2022] Open
Abstract
Introduction The traditional staging system is inadequate to identify those patients with stage II colorectal cancer (CRC) at high risk of recurrence or with stage III CRC at low risk. A number of gene expression signatures to predict CRC prognosis have been proposed, but none is routinely used in the clinic. The aim of this work was to assess the prediction ability and potential clinical usefulness of these signatures in a series of independent datasets. Methods A literature review identified 31 gene expression signatures that used gene expression data to predict prognosis in CRC tissue. The search was based on the PubMed database and was restricted to papers published from January 2004 to December 2011. Eleven CRC gene expression datasets with outcome information were identified and downloaded from public repositories. Random Forest classifier was used to build predictors from the gene lists. Matthews correlation coefficient was chosen as a measure of classification accuracy and its associated p-value was used to assess association with prognosis. For clinical usefulness evaluation, positive and negative post-tests probabilities were computed in stage II and III samples. Results Five gene signatures showed significant association with prognosis and provided reasonable prediction accuracy in their own training datasets. Nevertheless, all signatures showed low reproducibility in independent data. Stratified analyses by stage or microsatellite instability status showed significant association but limited discrimination ability, especially in stage II tumors. From a clinical perspective, the most predictive signatures showed a minor but significant improvement over the classical staging system. Conclusions The published signatures show low prediction accuracy but moderate clinical usefulness. Although gene expression data may inform prognosis, better strategies for signature validation are needed to encourage their widespread use in the clinic.
Collapse
Affiliation(s)
- Rebeca Sanz-Pamplona
- Unit of Biomarkers and Susceptibility (UBS), Catalan Institute of Oncology (ICO), Bellvitge Biomedical Research Institute (IDIBELL), and CIBERESP, L'Hospitalet de Llobregat, Barcelona, Spain
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
473
|
Srivastava MS, Reid N. Testing the structure of the covariance matrix with fewer observations than the dimension. J MULTIVARIATE ANAL 2012. [DOI: 10.1016/j.jmva.2012.06.004] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
474
|
Kandaswamy KK, Pugalenthi G, Kalies KU, Hartmann E, Martinetz T. EcmPred: prediction of extracellular matrix proteins based on random forest with maximum relevance minimum redundancy feature selection. J Theor Biol 2012; 317:377-83. [PMID: 23123454 DOI: 10.1016/j.jtbi.2012.10.015] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2012] [Revised: 10/08/2012] [Accepted: 10/09/2012] [Indexed: 12/11/2022]
Abstract
The extracellular matrix (ECM) is a major component of tissues of multicellular organisms. It consists of secreted macromolecules, mainly polysaccharides and glycoproteins. Malfunctions of ECM proteins lead to severe disorders such as marfan syndrome, osteogenesis imperfecta, numerous chondrodysplasias, and skin diseases. In this work, we report a random forest approach, EcmPred, for the prediction of ECM proteins from protein sequences. EcmPred was trained on a dataset containing 300 ECM and 300 non-ECM and tested on a dataset containing 145 ECM and 4187 non-ECM proteins. EcmPred achieved 83% accuracy on the training and 77% on the test dataset. EcmPred predicted 15 out of 20 experimentally verified ECM proteins. By scanning the entire human proteome, we predicted novel ECM proteins validated with gene ontology and InterPro. The dataset and standalone version of the EcmPred software is available at http://www.inb.uni-luebeck.de/tools-demos/Extracellular_matrix_proteins/EcmPred.
Collapse
|
475
|
Wu MY, Dai DQ, Shi Y, Yan H, Zhang XF. Biomarker identification and cancer classification based on microarray data using Laplace naive Bayes model with mean shrinkage. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1649-1662. [PMID: 22868679 DOI: 10.1109/tcbb.2012.105] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain outliers due to either chemical or electrical reasons. A good gene selection method should take group effects into account and be robust to outliers. In this paper, we propose a Laplace naive Bayes model with mean shrinkage (LNB-MS). The Laplace distribution instead of the normal distribution is used as the conditional distribution of the samples for the reasons that it is less sensitive to outliers and has been applied in many fields. The key technique is the L1 penalty imposed on the mean of each class to achieve automatic feature selection. The objective function of the proposed model is a piecewise linear function with respect to the mean of each class, of which the optimal value can be evaluated at the breakpoints simply. An efficient algorithm is designed to estimate the parameters in the model. A new strategy that uses the number of selected features to control the regularization parameter is introduced. Experimental results on simulated data sets and 17 publicly available cancer data sets attest to the accuracy, sparsity, efficiency, and robustness of the proposed algorithm. Many biomarkers identified with our method have been verified in biochemical or biomedical research. The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously
Collapse
Affiliation(s)
- Meng-Yun Wu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University,Guangzhou 510275, China.
| | | | | | | | | |
Collapse
|
476
|
Ameling S, Herda LR, Hammer E, Steil L, Teumer A, Trimpert C, Dörr M, Kroemer HK, Klingel K, Kandolf R, Völker U, Felix SB. Myocardial gene expression profiles and cardiodepressant autoantibodies predict response of patients with dilated cardiomyopathy to immunoadsorption therapy. Eur Heart J 2012; 34:666-75. [PMID: 23100283 PMCID: PMC3584995 DOI: 10.1093/eurheartj/ehs330] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Aims Immunoadsorption with subsequent immunoglobulin G substitution (IA/IgG) represents a novel therapeutic approach in the treatment of dilated cardiomyopathy (DCM) which leads to the improvement of left ventricular ejection fraction (LVEF). However, response to this therapeutic intervention shows wide inter-individual variability. In this pilot study, we tested the value of clinical, biochemical, and molecular parameters for the prediction of the response of patients with DCM to IA/IgG. Methods and results Forty DCM patients underwent endomyocardial biopsies (EMBs) before IA/IgG. In eight patients with normal LVEF (controls), EMBs were obtained for clinical reasons. Clinical parameters, negative inotropic activity (NIA) of antibodies on isolated rat cardiomyocytes, and gene expression profiles of EMBs were analysed. Dilated cardiomyopathy patients displaying improvement of LVEF (≥20 relative and ≥5% absolute) 6 months after IA/IgG were considered responders. Compared with non-responders (n = 16), responders (n = 24) displayed shorter disease duration (P = 0.006), smaller LV internal diameter in diastole (P = 0.019), and stronger NIA of antibodies. Antibodies obtained from controls were devoid of NIA. Myocardial gene expression patterns were different in responders and non-responders for genes of oxidative phosphorylation, mitochondrial dysfunction, hypertrophy, and ubiquitin–proteasome pathway. The integration of scores of NIA and expression levels of four genes allowed robust discrimination of responders from non-responders at baseline (BL) [sensitivity of 100% (95% CI 85.8–100%); specificity up to 100% (95% CI 79.4–100%); cut-off value: −0.28] and was superior to scores derived from antibodies, gene expression, or clinical parameters only. Conclusion Combined assessment of NIA of antibodies and gene expression patterns of DCM patients at BL predicts response to IA/IgG therapy and may enable appropriate selection of patients who benefit from this therapeutic intervention.
Collapse
Affiliation(s)
- Sabine Ameling
- Interfakultäres Institut für Genetik und Funktionelle Genomforschung, Universitätsmedizin Greifswald, Friedrich-Ludwig-Jahn-Strasse 15a, Greifswald D - 17487, Germany
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
477
|
Computational gene mapping to analyze continuous automated physiologic monitoring data in neuro-trauma intensive care. J Trauma Acute Care Surg 2012; 73:419-24; discussion 424-5. [PMID: 22846949 DOI: 10.1097/ta.0b013e31825ff59a] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
BACKGROUND We asked whether the advanced machine learning applications used in microarray gene profiling could assess critical thresholds in the massive databases generated by continuous electronic physiologic vital signs (VS) monitoring in the neuro-trauma intensive care unit. METHODS We used Class Prediction Analysis to predict binary outcomes (life/death, good/bad Extended Glasgow Outcome Score, etc.) based on data accrued within 12, 24, 48, and 72 hours after admission to the neuro-trauma intensive care unit. Univariate analyses selected "features," discriminator VS segments or "genes," in each individual's data set. Prediction models using these selected features were then constructed using six different statistical modeling techniques to predict outcome for other individuals in the sample cohort based on the selected features of each individual then cross-validated with a leave-one-out method. RESULTS We gleaned complete sets of 588 VS monitoring segment features for each of four periods and outcomes from 52 of 60 patients with severe traumatic brain injury who met study inclusion criteria. Overall, intracranial pressures and blood pressures over time (e.g., intracranial pressure >20 mm Hg for 20 minutes) provided the best discrimination for outcomes. Modeling performed best in the first 12 hours of care and for mortality. The mean number of selected features included 76 predicting 14-day hospital stay in that period, 11 predicting mortality, and 4 predicting 3-month Extended Glasgow Outcome Score. Four of the six techniques constructed models that correctly identified mortality by 12 hours 75% of the time or higher. CONCLUSION Our results suggest that valid prediction models after severe traumatic brain injury can be constructed using gene mapping techniques to analyze large data sets from conventional electronic monitoring data, but that this methodology needs validation in larger data sets, and that additional unstructured learning techniques may also prove useful.
Collapse
|
478
|
Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data. BMC Bioinformatics 2012; 13:270. [PMID: 23075381 PMCID: PMC3542193 DOI: 10.1186/1471-2105-13-270] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2012] [Accepted: 09/18/2012] [Indexed: 01/19/2023] Open
Abstract
Background A feature selection method in microarray gene expression data should be independent of platform, disease and dataset size. Our hypothesis is that among the statistically significant ranked genes in a gene list, there should be clusters of genes that share similar biological functions related to the investigated disease. Thus, instead of keeping N top ranked genes, it would be more appropriate to define and keep a number of gene cluster exemplars. Results We propose a hybrid FS method (mAP-KL), which combines multiple hypothesis testing and affinity propagation (AP)-clustering algorithm along with the Krzanowski & Lai cluster quality index, to select a small yet informative subset of genes. We applied mAP-KL on real microarray data, as well as on simulated data, and compared its performance against 13 other feature selection approaches. Across a variety of diseases and number of samples, mAP-KL presents competitive classification results, particularly in neuromuscular diseases, where its overall AUC score was 0.91. Furthermore, mAP-KL generates concise yet biologically relevant and informative N-gene expression signatures, which can serve as a valuable tool for diagnostic and prognostic purposes, as well as a source of potential disease biomarkers in a broad range of diseases. Conclusions mAP-KL is a data-driven and classifier-independent hybrid feature selection method, which applies to any disease classification problem based on microarray data, regardless of the available samples. Combining multiple hypothesis testing and AP leads to subsets of genes, which classify unknown samples from both, small and large patient cohorts with high accuracy.
Collapse
|
479
|
Miguéis VL, Van den Poel D, Camanho AS, Falcão e Cunha J. Predicting partial customer churn using Markov for discrimination for modeling first purchase sequences. ADV DATA ANAL CLASSI 2012. [DOI: 10.1007/s11634-012-0121-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
|
480
|
Use of gene expression data for predicting continuous phenotypes for animal production and breeding. Animal 2012; 2:1413-20. [PMID: 22443898 DOI: 10.1017/s1751731108002632] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Traits such as disease resistance are costly to evaluate and slow to improve using current methods. Analysis of gene expression profiles (e.g. DNA microarrays) has potential for predicting such phenotypes and has been used in an analogous way to classify cancer types in human patients. However, doubts have been raised regarding the use of classification methods with microarray data for this purpose. Here we propose a method using random regression with cross validation, which accounts for the distribution of variation in the trait and utilises different subsets of patients or animals to perform a complete validation of predictive ability. Published breast tumour data were used to test the method. Despite the small dataset (n < 100), the new approach resulted in a moderate but significant correlation between the predicted and actual phenotypes (0.32). Binary classification of the predicted phenotypes yielded similar classification error rates to those found by other authors (35%). Unlike other methods, the new method gave a quantitative estimate of phenotype that could be used to rank animals and select those with extreme phenotypic performance. Use of the method in an optimal way using larger sample sizes, and combining DNA microarrays and other testing platforms, is recommended.
Collapse
|
481
|
Nikolovski N, Rubtsov D, Segura MP, Miles GP, Stevens TJ, Dunkley TP, Munro S, Lilley KS, Dupree P. Putative glycosyltransferases and other plant Golgi apparatus proteins are revealed by LOPIT proteomics. PLANT PHYSIOLOGY 2012; 160:1037-51. [PMID: 22923678 PMCID: PMC3461528 DOI: 10.1104/pp.112.204263] [Citation(s) in RCA: 125] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2012] [Accepted: 08/22/2012] [Indexed: 05/18/2023]
Abstract
The Golgi apparatus is the central organelle in the secretory pathway and plays key roles in glycosylation, protein sorting, and secretion in plants. Enzymes involved in the biosynthesis of complex polysaccharides, glycoproteins, and glycolipids are located in this organelle, but the majority of them remain uncharacterized. Here, we studied the Arabidopsis (Arabidopsis thaliana) membrane proteome with a focus on the Golgi apparatus using localization of organelle proteins by isotope tagging. By applying multivariate data analysis to a combined data set of two new and two previously published localization of organelle proteins by isotope tagging experiments, we identified the subcellular localization of 1,110 proteins with high confidence. These include 197 Golgi apparatus proteins, 79 of which have not been localized previously by a high-confidence method, as well as the localization of 304 endoplasmic reticulum and 208 plasma membrane proteins. Comparison of the hydrophobic domains of the localized proteins showed that the single-span transmembrane domains have unique properties in each organelle. Many of the novel Golgi-localized proteins belong to uncharacterized protein families. Structure-based homology analysis identified 12 putative Golgi glycosyltransferase (GT) families that have no functionally characterized members and, therefore, are not yet assigned to a Carbohydrate-Active Enzymes database GT family. The substantial numbers of these putative GTs lead us to estimate that the true number of plant Golgi GTs might be one-third above those currently annotated. Other newly identified proteins are likely to be involved in the transport and interconversion of nucleotide sugar substrates as well as polysaccharide and protein modification.
Collapse
|
482
|
Yang J, Miescke K, McCullagh P. Classification based on a permanental process with cyclic approximation. Biometrika 2012. [DOI: 10.1093/biomet/ass047] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
483
|
Modelling Forest α-Diversity and Floristic Composition — On the Added Value of LiDAR plus Hyperspectral Remote Sensing. REMOTE SENSING 2012. [DOI: 10.3390/rs4092818] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
484
|
Jiang W, Chen BE. Estimating prediction error in microarray classification: Modifications of the 0.632+ bootstrap when ${\bf n} < {\bf p}$. CAN J STAT 2012. [DOI: 10.1002/cjs.11158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
485
|
Optimal gene subset selection using the modified SFFS algorithm for tumor classification. Neural Comput Appl 2012. [DOI: 10.1007/s00521-012-1148-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
486
|
Hanczar B, Bar-Hen A. A new measure of classifier performance for gene expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1379-1386. [PMID: 22291161 DOI: 10.1109/tcbb.2012.21] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
One of the major aims of many microarray experiments is to build discriminatory diagnosis and prognosis models. A large number of supervised methods have been proposed in literature for microarray-based classification for this purpose. Model evaluation and comparison is a critical issue and, the most of the time, is based on the classification cost. This classification cost is based on the costs of false positives and false negative, that are generally unknown in diagnostics problems. This uncertainty may highly impact the evaluation and comparison of the classifiers. We propose a new measure of classifier performance that takes account of the uncertainty of the error. We represent the available knowledge about the costs by a distribution function defined on the ratio of the costs. The performance of a classifier is therefore computed over the set of all possible costs weighted by their probability distribution. Our method is tested on both artificial and real microarray data sets. We show that the performance of classifiers is very depending of the ratio of the classification costs. In many cases, the best classifier can be identified by our new measure whereas the classic error measures fail.
Collapse
|
487
|
Liu S, Mundra PA, Rajapakse JC. Features for cells and nuclei classification. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2012; 2011:6601-4. [PMID: 22255852 DOI: 10.1109/iembs.2011.6091628] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The performance of automated analysis of cellular images is heavily influenced by the features that characterize cells or cell nuclei. In this paper, an exhaustive set of features including morphological, topological, and texture features are explored to determine the optimal features for classification of cells and cell nuclei. The optimal subset of features are obtained using popular feature selection methods. The results of feature selection indicate that Zernike moment, Daubechies wavelets, and Gabor wavelets give the most important features for the classification of cells or cell nuclei in fluorescent microscopy images.
Collapse
Affiliation(s)
- Song Liu
- BioInformatics Research Centre, School of Computer Engineering, Nanyang Technological university, Singapore.
| | | | | |
Collapse
|
488
|
Zhou K, Ai C, Dong P, Fan X, Yang L. A novel model to predict O-glycosylation sites using a highly unbalanced dataset. Glycoconj J 2012; 29:551-64. [DOI: 10.1007/s10719-012-9434-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Revised: 07/11/2012] [Accepted: 07/17/2012] [Indexed: 10/28/2022]
|
489
|
Jahandideh S, Srinivasasainagendra V, Zhi D. Comprehensive comparative analysis and identification of RNA-binding protein domains: multi-class classification and feature selection. J Theor Biol 2012; 312:65-75. [PMID: 22884576 DOI: 10.1016/j.jtbi.2012.07.013] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2012] [Revised: 07/09/2012] [Accepted: 07/13/2012] [Indexed: 01/11/2023]
Abstract
RNA-protein interaction plays an important role in various cellular processes, such as protein synthesis, gene regulation, post-transcriptional gene regulation, alternative splicing, and infections by RNA viruses. In this study, using Gene Ontology Annotated (GOA) and Structural Classification of Proteins (SCOP) databases an automatic procedure was designed to capture structurally solved RNA-binding protein domains in different subclasses. Subsequently, we applied tuned multi-class SVM (TMCSVM), Random Forest (RF), and multi-class ℓ1/ℓq-regularized logistic regression (MCRLR) for analysis and classifying RNA-binding protein domains based on a comprehensive set of sequence and structural features. In this study, we compared prediction accuracy of three different state-of-the-art predictor methods. From our results, TMCSVM outperforms the other methods and suggests the potential of TMCSVM as a useful tool for facilitating the multi-class prediction of RNA-binding protein domains. On the other hand, MCRLR by elucidating importance of features for their contribution in predictive accuracy of RNA-binding protein domains subclasses, helps us to provide some biological insights into the roles of sequences and structures in protein-RNA interactions.
Collapse
Affiliation(s)
- Samad Jahandideh
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA.
| | - Vinodh Srinivasasainagendra
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Degui Zhi
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA.
| |
Collapse
|
490
|
Wang T, Xu PR, Zhu LX. Non-convex penalized estimation in high-dimensional models with single-index structure. J MULTIVARIATE ANAL 2012. [DOI: 10.1016/j.jmva.2012.03.009] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
491
|
Wit EC, Bakewell DJG. Borrowing strength: a likelihood ratio test for related sparse signals. Bioinformatics 2012; 28:1980-9. [PMID: 22668791 DOI: 10.1093/bioinformatics/bts316] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Cancer biology is a field where the complexity of the phenomena battles against the availability of data. Often only a few observations per signal source, i.e. genes, are available. Such scenarios are becoming increasingly more relevant as modern sensing technologies generally have no trouble in measuring lots of channels, but where the number of subjects, such as patients or samples, is limited. In statistics, this problem falls under the heading 'large p, small n'. Moreover, in such situations the use of asymptotic analytical results should generally be mistrusted. RESULTS We consider two cancer datasets, with the aim to mine the activity of functional groups of genes. We propose a hierarchical model with two layers in which the individual signals share a common variance component. A likelihood ratio test is defined for the difference between two collections of corresponding signals. The small number of observations requires a careful consideration of the bias of the statistic, which is corrected through an explicit Bartlett correction. The test is validated on Monte Carlo simulations, which show improved detection of differences compared with other methods. In a leukaemia study and a cancerous fibroblast cell line, we find that the method also works better in practice, i.e. it gives a richer picture of the underlying biology. AVAILABILITY The MATLAB code is available from the authors or on http://www.math.rug.nl/stat/Software. CONTACT e.c.wit@rug.nl d.bakewell@liv.ac.uk.
Collapse
Affiliation(s)
- Ernst C Wit
- Johann Bernoulli Institute, University of Groningen, 9747 AG Groningen, The Netherlands.
| | | |
Collapse
|
492
|
Abstract
This paper is concerned with the selection and estimation of fixed and random effects in linear mixed effects models. We propose a class of nonconcave penalized profile likelihood methods for selecting and estimating important fixed effects. To overcome the difficulty of unknown covariance matrix of random effects, we propose to use a proxy matrix in the penalized profile likelihood. We establish conditions on the choice of the proxy matrix and show that the proposed procedure enjoys the model selection consistency where the number of fixed effects is allowed to grow exponentially with the sample size. We further propose a group variable selection strategy to simultaneously select and estimate important random effects, where the unknown covariance matrix of random effects is replaced with a proxy matrix. We prove that, with the proxy matrix appropriately chosen, the proposed procedure can identify all true random effects with asymptotic probability one, where the dimension of random effects vector is allowed to increase exponentially with the sample size. Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures. We further illustrate the proposed procedures via a real data example.
Collapse
Affiliation(s)
- Yingying Fan
- Information and Operations Management, Department Marshall School of Business, University of Southern California, Los Angeles, CA 90089, USA
| | - Runze Li
- Department of Statistics and the Methodology Center, The Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
493
|
Wang SL, Li XL, Fang J. Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification. BMC Bioinformatics 2012; 13:178. [PMID: 22830977 PMCID: PMC3465202 DOI: 10.1186/1471-2105-13-178] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2011] [Accepted: 05/18/2012] [Indexed: 01/03/2023] Open
Abstract
Background Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development. Results This paper proposes a novel gene selection method with rich biomedical meaning based on Heuristic Breadth-first Search Algorithm (HBSA) to find as many optimal gene subsets as possible. Due to the curse of dimensionality, this type of method could suffer from over-fitting and selection bias problems. To address these potential problems, a HBSA-based ensemble classifier is constructed using majority voting strategy from individual classifiers constructed by the selected gene subsets, and a novel HBSA-based gene ranking method is designed to find important tumor-related genes by measuring the significance of genes using their occurrence frequencies in the selected gene subsets. The experimental results on nine tumor datasets including three pairs of cross-platform datasets indicate that the proposed method can not only obtain better generalization performance but also find many important tumor-related genes. Conclusions It is found that the frequencies of the selected genes follow a power-law distribution, indicating that only a few top-ranked genes can be used as potential diagnosis biomarkers. Moreover, the top-ranked genes leading to very high prediction accuracy are closely related to specific tumor subtype and even hub genes. Compared with other related methods, the proposed method can achieve higher prediction accuracy with fewer genes. Moreover, they are further justified by analyzing the top-ranked genes in the context of individual gene function, biological pathway, and protein-protein interaction network.
Collapse
Affiliation(s)
- Shu-Lin Wang
- Applied Bioinformatics Laboratory, University of Kansas, 2034 Becker Drive, Lawrence, KS 66047, USA
| | | | | |
Collapse
|
494
|
Analysis of high dimensional data using pre-defined set and subset information, with applications to genomic data. BMC Bioinformatics 2012; 13:177. [PMID: 22827252 PMCID: PMC3443674 DOI: 10.1186/1471-2105-13-177] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2011] [Accepted: 05/11/2012] [Indexed: 11/25/2022] Open
Abstract
Background Based on available biological information, genomic data can often be partitioned into pre-defined sets (e.g. pathways) and subsets within sets. Biologists are often interested in determining whether some pre-defined sets of variables (e.g. genes) are differentially expressed under varying experimental conditions. Several procedures are available in the literature for making such determinations, however, they do not take into account information regarding the subsets within each set. Secondly, variables (e.g. genes) belonging to a set or a subset are potentially correlated, yet such information is often ignored and univariate methods are used. This may result in loss of power and/or inflated false positive rate. Results We introduce a multiple testing-based methodology which makes use of available information regarding biologically relevant subsets within each pre-defined set of variables while exploiting the underlying dependence structure among the variables. Using this methodology, a biologist may not only determine whether a set of variables are differentially expressed between two experimental conditions, but may also test whether specific subsets within a significant set are also significant. Conclusions The proposed methodology; (a) is easy to implement, (b) does not require inverting potentially singular covariance matrices, and (c) controls the family wise error rate (FWER) at the desired nominal level, (d) is robust to the underlying distribution and covariance structures. Although for simplicity of exposition, the methodology is described for microarray gene expression data, it is also applicable to any high dimensional data, such as the mRNA seq data, CpG methylation data etc.
Collapse
|
495
|
van Vliet MH, Horlings HM, van de Vijver MJ, Reinders MJT, Wessels LFA. Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS One 2012; 7:e40358. [PMID: 22808140 PMCID: PMC3394805 DOI: 10.1371/journal.pone.0040358] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2012] [Accepted: 06/06/2012] [Indexed: 12/12/2022] Open
Abstract
Breast cancer outcome can be predicted using models derived from gene expression data or clinical data. Only a few studies have created a single prediction model using both gene expression and clinical data. These studies often remain inconclusive regarding an obtained improvement in prediction performance. We rigorously compare three different integration strategies (early, intermediate, and late integration) as well as classifiers employing no integration (only one data type) using five classifiers of varying complexity. We perform our analysis on a set of 295 breast cancer samples, for which gene expression data and an extensive set of clinical parameters are available as well as four breast cancer datasets containing 521 samples that we used as independent validation.mOn the 295 samples, a nearest mean classifier employing a logical OR operation (late integration) on clinical and expression classifiers significantly outperforms all other classifiers. Moreover, regardless of the integration strategy, the nearest mean classifier achieves the best performance. All five classifiers achieve their best performance when integrating clinical and expression data. Repeating the experiments using the 521 samples from the four independent validation datasets also indicated a significant performance improvement when integrating clinical and gene expression data. Whether integration also improves performances on other datasets (e.g. other tumor types) has not been investigated, but seems worthwhile pursuing. Our work suggests that future models for predicting breast cancer outcome should exploit both data types by employing a late OR or intermediate integration strategy based on nearest mean classifiers.
Collapse
Affiliation(s)
- Martin H van Vliet
- Delft Bioinformatics Laboratory, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg, Delft, The Netherlands.
| | | | | | | | | |
Collapse
|
496
|
Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, de Schaetzen V, Duque R, Bersini H, Nowé A. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1106-19. [PMID: 22350210 DOI: 10.1109/tcbb.2012.33] [Citation(s) in RCA: 219] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
A plenitude of feature selection (FS) methods is available in the literature, most of them rising as a need to analyze data of very high dimension, usually hundreds or thousands of variables. Such data sets are now available in various application areas like combinatorial chemistry, text mining, multivariate imaging, or bioinformatics. As a general accepted rule, these methods are grouped in filters, wrappers, and embedded methods. More recently, a new group of methods has been added in the general framework of FS: ensemble techniques. The focus in this survey is on filter feature selection methods for informative feature discovery in gene expression microarray (GEM) analysis, which is also known as differentially expressed genes (DEGs) discovery, gene prioritization, or biomarker discovery. We present them in a unified framework, using standardized notations in order to reveal their technical details and to highlight their common characteristics as well as their particularities.
Collapse
Affiliation(s)
- Cosmin Lazar
- Computational Modeling Group, Department of Computer Science, Vrije Universiteit Brussel, Pleinlaan 2, Brussels 1050, Belgium.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
497
|
Silva-Fortes C, Amaral Turkman MA, Sousa L. Arrow plot: a new graphical tool for selecting up and down regulated genes and genes differentially expressed on sample subgroups. BMC Bioinformatics 2012; 13:147. [PMID: 22734592 PMCID: PMC3542259 DOI: 10.1186/1471-2105-13-147] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2011] [Accepted: 06/14/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A common task in analyzing microarray data is to determine which genes are differentially expressed across two (or more) kind of tissue samples or samples submitted under experimental conditions. Several statistical methods have been proposed to accomplish this goal, generally based on measures of distance between classes. It is well known that biological samples are heterogeneous because of factors such as molecular subtypes or genetic background that are often unknown to the experimenter. For instance, in experiments which involve molecular classification of tumors it is important to identify significant subtypes of cancer. Bimodal or multimodal distributions often reflect the presence of subsamples mixtures. Consequently, there can be genes differentially expressed on sample subgroups which are missed if usual statistical approaches are used. In this paper we propose a new graphical tool which not only identifies genes with up and down regulations, but also genes with differential expression in different subclasses, that are usually missed if current statistical methods are used. This tool is based on two measures of distance between samples, namely the overlapping coefficient (OVL) between two densities and the area under the receiver operating characteristic (ROC) curve. The methodology proposed here was implemented in the open-source R software. RESULTS This method was applied to a publicly available dataset, as well as to a simulated dataset. We compared our results with the ones obtained using some of the standard methods for detecting differentially expressed genes, namely Welch t-statistic, fold change (FC), rank products (RP), average difference (AD), weighted average difference (WAD), moderated t-statistic (modT), intensity-based moderated t-statistic (ibmT), significance analysis of microarrays (samT) and area under the ROC curve (AUC). On both datasets all differentially expressed genes with bimodal or multimodal distributions were not selected by all standard selection procedures. We also compared our results with (i) area between ROC curve and rising area (ABCR) and (ii) the test for not proper ROC curves (TNRC). We found our methodology more comprehensive, because it detects both bimodal and multimodal distributions and different variances can be considered on both samples. Another advantage of our method is that we can analyze graphically the behavior of different kinds of differentially expressed genes. CONCLUSION Our results indicate that the arrow plot represents a new flexible and useful tool for the analysis of gene expression profiles from microarrays.
Collapse
Affiliation(s)
- Carina Silva-Fortes
- Natural and Exact Sciences Department, Higher School of Health Technology of Lisbon of Polytechnic Institute of Lisbon and Center of Statistics and Applications of University of Lisbon, Lisbon, Portugal.
| | | | | |
Collapse
|
498
|
Hu P, Bull SB, Jiang H. Gene network modular-based classification of microarray samples. BMC Bioinformatics 2012; 13 Suppl 10:S17. [PMID: 22759422 PMCID: PMC3314572 DOI: 10.1186/1471-2105-13-s10-s17] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Background Molecular predictor is a new tool for disease diagnosis, which uses gene expression to classify diagnostic category of a patient. The statistical challenge for constructing such a predictor is that there are thousands of genes to predict for the disease categories, but only a small number of samples are available. Results We proposed a gene network modular-based linear discriminant analysis approach by integrating 'essential' correlation structure among genes into the predictor in order that the modules or cluster structures of genes, which are related to the diagnostic classes we look for, can have potential biological interpretation. We evaluated performance of the new method with other established classification methods using three real data sets. Conclusions Our results show that the new approach has the advantage of computational simplicity and efficiency with relatively lower classification error rates than the compared methods in many cases. The modular-based linear discriminant analysis approach induced in the study has the potential to increase the power of discriminant analysis for which sample sizes are small and there are large number of genes in the microarray studies.
Collapse
Affiliation(s)
- Pingzhao Hu
- Department of Computer Science and Engineering, York University, Toronto, M3J 1P3, Canada.
| | | | | |
Collapse
|
499
|
Sharma A, Imoto S, Miyano S. A between-class overlapping filter-based method for transcriptome data analysis. J Bioinform Comput Biol 2012; 10:1250010. [PMID: 22849365 DOI: 10.1142/s0219720012500102] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Feature selection algorithms play a crucial role in identifying and discovering important genes for cancer classification. Feature selection algorithms can be broadly categorized into two main groups: filter-based methods and wrapper-based methods. Filter-based methods have been quite popular in the literature due to their many advantages, including computational efficiency, simplistic architecture, and an intuitively simple means of discovering biological and clinical aspects. However, these methods have limitations, and the classification accuracy of the selected genes is less accurate. In this paper, we propose a set of univariate filter-based methods using a between-class overlapping criterion. The proposed techniques have been compared with many other univariate filter-based methods using an acute leukemia dataset. The following properties have been examined: classification accuracy of the selected individual genes and the gene subsets; redundancy check among selected genes using ridge regression and LASSO methods; similarity and sensitivity analyses; functional analysis; and, stability analysis. A comprehensive experiment shows promising results for our proposed techniques. The univariate filter based methods using between-class overlapping criterion are accurate and robust, have biological significance, and are computationally efficient and easy to implement. Therefore, they are well suited for biological and clinical discoveries.
Collapse
Affiliation(s)
- Alok Sharma
- Laboratory of DNA Information Analysis, Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan.
| | | | | |
Collapse
|
500
|
Shah RD, Samworth RJ. Variable selection with error control: another look at stability selection. J R Stat Soc Series B Stat Methodol 2012. [DOI: 10.1111/j.1467-9868.2011.01034.x] [Citation(s) in RCA: 158] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|