101
|
Ocampo-Vega R, Sanchez-Ante G, de Luna MA, Vega R, Falcón-Morales LE, Sossa H. Improving pattern classification of DNA microarray data by using PCA and logistic regression. INTELL DATA ANAL 2016. [DOI: 10.3233/ida-160845] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Ricardo Ocampo-Vega
- Data Visualization and Pattern Recognition Lab, Tecnológico de Monterrey, Campus Guadalajara, Zapopan, México
| | - Gildardo Sanchez-Ante
- Data Visualization and Pattern Recognition Lab, Tecnológico de Monterrey, Campus Guadalajara, Zapopan, México
| | - Marco A. de Luna
- Data Visualization and Pattern Recognition Lab, Tecnológico de Monterrey, Campus Guadalajara, Zapopan, México
| | - Roberto Vega
- Data Visualization and Pattern Recognition Lab, Tecnológico de Monterrey, Campus Guadalajara, Zapopan, México
| | - Luis E. Falcón-Morales
- Data Visualization and Pattern Recognition Lab, Tecnológico de Monterrey, Campus Guadalajara, Zapopan, México
| | - Humberto Sossa
- Instituto Politécnico Nacional-CIC, México, Distrito Federal, México
| |
Collapse
|
102
|
Pasolli E, Truong DT, Malik F, Waldron L, Segata N. Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights. PLoS Comput Biol 2016; 12:e1004977. [PMID: 27400279 PMCID: PMC4939962 DOI: 10.1371/journal.pcbi.1004977] [Citation(s) in RCA: 344] [Impact Index Per Article: 38.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2015] [Accepted: 05/11/2016] [Indexed: 12/12/2022] Open
Abstract
Shotgun metagenomic analysis of the human associated microbiome provides a rich set of microbial features for prediction and biomarker discovery in the context of human diseases and health conditions. However, the use of such high-resolution microbial features presents new challenges, and validated computational tools for learning tasks are lacking. Moreover, classification rules have scarcely been validated in independent studies, posing questions about the generality and generalization of disease-predictive models across cohorts. In this paper, we comprehensively assess approaches to metagenomics-based prediction tasks and for quantitative assessment of the strength of potential microbiome-phenotype associations. We develop a computational framework for prediction tasks using quantitative microbiome profiles, including species-level relative abundances and presence of strain-specific markers. A comprehensive meta-analysis, with particular emphasis on generalization across cohorts, was performed in a collection of 2424 publicly available metagenomic samples from eight large-scale studies. Cross-validation revealed good disease-prediction capabilities, which were in general improved by feature selection and use of strain-specific markers instead of species-level taxonomic abundance. In cross-study analysis, models transferred between studies were in some cases less accurate than models tested by within-study cross-validation. Interestingly, the addition of healthy (control) samples from other studies to training sets improved disease prediction capabilities. Some microbial species (most notably Streptococcus anginosus) seem to characterize general dysbiotic states of the microbiome rather than connections with a specific disease. Our results in modelling features of the “healthy” microbiome can be considered a first step toward defining general microbial dysbiosis. The software framework, microbiome profiles, and metadata for thousands of samples are publicly available at http://segatalab.cibio.unitn.it/tools/metaml. The human microbiome–the entire set of microbial organisms associated with the human host–interacts closely with host immune and metabolic functions and is crucial for human health. Significant advances in the characterization of the microbiome associated with healthy and diseased individuals have been obtained through next-generation DNA sequencing technologies, which permit accurate estimation of microbial communities directly from uncultured human-associated samples (e.g., stool). In particular, shotgun metagenomics provide data at unprecedented species- and strain- levels of resolution. Several large-scale metagenomic disease-associated datasets are also becoming available, and disease-predictive models built on metagenomic signatures have been proposed. However, the generalization of resulting prediction models on different cohorts and diseases has not been validated. In this paper, we comprehensively assess approaches to metagenomics-based prediction tasks and for quantitative assessment of microbiome-phenotype associations. We consider 2424 samples from eight studies and six different diseases to assess the independent prediction accuracy of models built on shotgun metagenomic data and to compare strategies for practical use of the microbiome as a prediction tool.
Collapse
Affiliation(s)
- Edoardo Pasolli
- Centre for Integrative Biology, University of Trento, Trento, Italy
| | - Duy Tin Truong
- Centre for Integrative Biology, University of Trento, Trento, Italy
| | - Faizan Malik
- Graduate School of Public Health and Health Policy, City University of New York, New York, New York, United States of America
| | - Levi Waldron
- Graduate School of Public Health and Health Policy, City University of New York, New York, New York, United States of America
| | - Nicola Segata
- Centre for Integrative Biology, University of Trento, Trento, Italy
- * E-mail:
| |
Collapse
|
103
|
Faria AWC, da Silva AM, de Souza Rodrigues T, Costa MA, Braga AP. A Ranking Approach for Probe Selection and Classification of Microarray Data with Artificial Neural Networks. J Comput Biol 2016; 22:953-61. [PMID: 26418055 DOI: 10.1089/cmb.2013.0125] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Acute leukemia classification into its myeloid and lymphoblastic subtypes is usually accomplished according to the morphology of the tumor. Nevertheless, the subtypes may have similar histopathological appearance, making screening procedures difficult. In addition, approximately one-third of acute myeloid leukemias are characterized by aberrant cytoplasmic localization of nucleophosmin (NPMc(+)), where the majority has a normal karyotype. This work is based on two DNA microarray datasets, available publicly, to differentiate leukemia subtypes. The datasets were split into training and test sets, and feature selection methods were applied. Artificial neural network classifiers were developed to compare the feature selection methods. For the first dataset, 50 genes selected using the best classifier was able to classify all patients in the test set. For the second dataset, five genes yielded 97.5% accuracy in the test set.
Collapse
Affiliation(s)
| | | | - Thiago de Souza Rodrigues
- 2 Computer Department, Federal Center of Technological Education of Minas Gerais , Belo Horizonte, MG, Brazil
| | - Marcelo Azevedo Costa
- 1 Graduate Program in Electrical Engineering, Federal University of Minas Gerais , Belo Horizonte, MG, Brazil
| | - Antonio Padua Braga
- 1 Graduate Program in Electrical Engineering, Federal University of Minas Gerais , Belo Horizonte, MG, Brazil
| |
Collapse
|
104
|
Nayyeri M, Sharifi Noghabi H. Cancer classification by correntropy-based sparse compact incremental learning machine. GENE REPORTS 2016. [DOI: 10.1016/j.genrep.2016.01.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
105
|
Duy J, Koehler JW, Honko AN, Schoepp RJ, Wauquier N, Gonzalez JP, Pitt ML, Mucker EM, Johnson JC, O’Hearn A, Bangura J, Coomber M, Minogue TD. Circulating microRNA profiles of Ebola virus infection. Sci Rep 2016; 6:24496. [PMID: 27098369 PMCID: PMC4838880 DOI: 10.1038/srep24496] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 03/30/2016] [Indexed: 01/08/2023] Open
Abstract
Early detection of Ebola virus (EBOV) infection is essential to halting transmission and adjudicating appropriate treatment. However, current methods rely on viral identification, and this approach can misdiagnose presymptomatic and asymptomatic individuals. In contrast, disease-driven alterations in the host transcriptome can be exploited for pathogen-specific diagnostic biomarkers. Here, we present for the first time EBOV-induced changes in circulating miRNA populations of nonhuman primates (NHPs) and humans. We retrospectively profiled longitudinally-collected plasma samples from rhesus macaques challenged via intramuscular and aerosol routes and found 36 miRNAs differentially present in both groups. Comparison of miRNA abundances to viral loads uncovered 15 highly correlated miRNAs common to EBOV-infected NHPs and humans. As proof of principle, we developed an eight-miRNA classifier that correctly categorized infection status in 64/74 (86%) human and NHP samples. The classifier identified acute infections in 27/29 (93.1%) samples and in 6/12 (50%) presymptomatic NHPs. These findings showed applicability of NHP-derived miRNAs to a human cohort, and with additional research the resulting classifiers could impact the current capability to diagnose presymptomatic and asymptomatic EBOV infections.
Collapse
Affiliation(s)
- Janice Duy
- Diagnostic Systems Division, U.S. Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD, USA
| | - Jeffrey W. Koehler
- Diagnostic Systems Division, U.S. Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD, USA
| | - Anna N. Honko
- Virology Division, U.S. Army Medical Institute of Infectious Diseases, Fort Detrick, Frederick, MD, USA
| | - Randal J. Schoepp
- Diagnostic Systems Division, U.S. Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD, USA
| | | | | | - M. Louise Pitt
- Virology Division, U.S. Army Medical Institute of Infectious Diseases, Fort Detrick, Frederick, MD, USA
| | - Eric M. Mucker
- Virology Division, U.S. Army Medical Institute of Infectious Diseases, Fort Detrick, Frederick, MD, USA
| | - Joshua C. Johnson
- Virology Division, U.S. Army Medical Institute of Infectious Diseases, Fort Detrick, Frederick, MD, USA
| | - Aileen O’Hearn
- Diagnostic Systems Division, U.S. Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD, USA
| | | | | | - Timothy D. Minogue
- Diagnostic Systems Division, U.S. Army Medical Research Institute of Infectious Diseases, Fort Detrick, Frederick, MD, USA
| |
Collapse
|
106
|
Cho SS, Kim Y, Yoon J, Seo M, Shin SK, Kwon EY, Kim SE, Bae YJ, Lee S, Sung MK, Choi MS, Park T. A Model-Based Joint Identification of Differentially Expressed Genes and Phenotype-Associated Genes. PLoS One 2016; 11:e0149086. [PMID: 26964035 PMCID: PMC4786130 DOI: 10.1371/journal.pone.0149086] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 01/27/2016] [Indexed: 12/11/2022] Open
Abstract
Over the last decade, many analytical methods and tools have been developed for microarray data. The detection of differentially expressed genes (DEGs) among different treatment groups is often a primary purpose of microarray data analysis. In addition, association studies investigating the relationship between genes and a phenotype of interest such as survival time are also popular in microarray data analysis. Phenotype association analysis provides a list of phenotype-associated genes (PAGs). However, it is sometimes necessary to identify genes that are both DEGs and PAGs. We consider the joint identification of DEGs and PAGs in microarray data analyses. The first approach we used was a naïve approach that detects DEGs and PAGs separately and then identifies the genes in an intersection of the list of PAGs and DEGs. The second approach we considered was a hierarchical approach that detects DEGs first and then chooses PAGs from among the DEGs or vice versa. In this study, we propose a new model-based approach for the joint identification of DEGs and PAGs. Unlike the previous two-step approaches, the proposed method identifies genes simultaneously that are DEGs and PAGs. This method uses standard regression models but adopts different null hypothesis from ordinary regression models, which allows us to perform joint identification in one-step. The proposed model-based methods were evaluated using experimental data and simulation studies. The proposed methods were used to analyze a microarray experiment in which the main interest lies in detecting genes that are both DEGs and PAGs, where DEGs are identified between two diet groups and PAGs are associated with four phenotypes reflecting the expression of leptin, adiponectin, insulin-like growth factor 1, and insulin. Model-based approaches provided a larger number of genes, which are both DEGs and PAGs, than other methods. Simulation studies showed that they have more power than other methods. Through analysis of data from experimental microarrays and simulation studies, the proposed model-based approach was shown to provide a more powerful result than the naïve approach and the hierarchical approach. Since our approach is model-based, it is very flexible and can easily handle different types of covariates.
Collapse
Affiliation(s)
- Samuel Sunghwan Cho
- Interdisciplinary Program in Bioinformatics, Seoul National University, Kwan-ak St. 599, Kwan-ak Gu, Seoul, Korea
| | - Yongkang Kim
- Department of Statistics, Seoul National University, Kwan-ak St. 599, Kwan-ak Gu, Seoul, Korea
| | - Joon Yoon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Kwan-ak St. 599, Kwan-ak Gu, Seoul, Korea
| | - Minseok Seo
- Interdisciplinary Program in Bioinformatics, Seoul National University, Kwan-ak St. 599, Kwan-ak Gu, Seoul, Korea
| | - Su-kyung Shin
- Center for Food and Nutritional Genomics Research, Department of Food Science and Nutrition, Kyungpook National University, Daegu, Korea
| | - Eun-Young Kwon
- Center for Food and Nutritional Genomics Research, Department of Food Science and Nutrition, Kyungpook National University, Daegu, Korea
| | - Sung-Eun Kim
- Department of Food and Nutrition, Sookmyung Women’s University, Seoul, Korea
| | - Yun-Jung Bae
- Division of Food Science and Culinary Arts, Shinhan University, Gyeonggi, Korea
| | - Seungyeoun Lee
- Department of Mathematics and Statistics, Sejong University, Seoul, Korea
| | - Mi-Kyung Sung
- Department of Food and Nutrition, Sookmyung Women’s University, Seoul, Korea
| | - Myung-Sook Choi
- Center for Food and Nutritional Genomics Research, Department of Food Science and Nutrition, Kyungpook National University, Daegu, Korea
| | - Taesung Park
- Interdisciplinary Program in Bioinformatics, Seoul National University, Kwan-ak St. 599, Kwan-ak Gu, Seoul, Korea
- Department of Statistics, Seoul National University, Kwan-ak St. 599, Kwan-ak Gu, Seoul, Korea
- * E-mail:
| |
Collapse
|
107
|
Gong P, Nan X, Barker ND, Boyd RE, Chen Y, Wilkins DE, Johnson DR, Suedel BC, Perkins EJ. Predicting chemical bioavailability using microarray gene expression data and regression modeling: A tale of three explosive compounds. BMC Genomics 2016; 17:205. [PMID: 26956490 PMCID: PMC4784335 DOI: 10.1186/s12864-016-2541-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Accepted: 02/25/2016] [Indexed: 11/10/2022] Open
Abstract
Background Chemical bioavailability is an important dose metric in environmental risk assessment. Although many approaches have been used to evaluate bioavailability, not a single approach is free from limitations. Previously, we developed a new genomics-based approach that integrated microarray technology and regression modeling for predicting bioavailability (tissue residue) of explosives compounds in exposed earthworms. In the present study, we further compared 18 different regression models and performed variable selection simultaneously with parameter estimation. Results This refined approach was applied to both previously collected and newly acquired earthworm microarray gene expression datasets for three explosive compounds. Our results demonstrate that a prediction accuracy of R2 = 0.71–0.82 was achievable at a relatively low model complexity with as few as 3–10 predictor genes per model. These results are much more encouraging than our previous ones. Conclusion This study has demonstrated that our approach is promising for bioavailability measurement, which warrants further studies of mixed contamination scenarios in field settings Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2541-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ping Gong
- Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| | - Xiaofei Nan
- Department of Computer and Information Science, University of Mississippi, Oxford, Mississippi, 38677, USA. .,Present Address: School of Information Engineering, Zhengzhou University, Zhengzhou, Henan, 450001, China.
| | | | - Robert E Boyd
- Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| | - Yixin Chen
- Department of Computer and Information Science, University of Mississippi, Oxford, Mississippi, 38677, USA.
| | - Dawn E Wilkins
- Department of Computer and Information Science, University of Mississippi, Oxford, Mississippi, 38677, USA.
| | | | - Burton C Suedel
- Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| | - Edward J Perkins
- Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| |
Collapse
|
108
|
Armananzas R, Iglesias M, Morales DA, Alonso-Nanclares L. Voxel-Based Diagnosis of Alzheimer's Disease Using Classifier Ensembles. IEEE J Biomed Health Inform 2016; 21:778-784. [PMID: 28113481 DOI: 10.1109/jbhi.2016.2538559] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Functional magnetic resonance imaging (fMRI) is one of the most promising noninvasive techniques for early Alzheimer's disease (AD) diagnosis. In this paper, we explore the application of different machine learning techniques to the classification of fMRI data for this purpose. The functional images were first preprocessed using the statistical parametric mapping toolbox to output individual maps of statistically activated voxels. A fast filter was applied afterwards to select voxels commonly activated across demented and nondemented groups. Four feature ranking selection techniques were embedded into a wrapper scheme using an inner-outer loop for the selection of relevant voxels. The wrapper approach was guided by the performance of six pattern recognition models, three of which were ensemble classifiers based on stochastic searches. Final classification performance was assessed from the nested internal and external cross-validation loops taking several voxel sets ordered by importance. Numerical performance was evaluated using statistical tests, and the best combination of voxel selection and classification reached a 97.14% average accuracy. Results repeatedly pointed out Brodmann regions with distinct activation patterns between demented and nondemented profiles, indicating that the machine learning analysis described is a powerful method to detect differences in several brain regions between both groups.
Collapse
|
109
|
Choi SB, Park JS, Chung JW, Kim SW, Kim DW. Prediction of ATLS hypovolemic shock class in rats using the perfusion index and lactate concentration. Shock 2016; 43:361-8. [PMID: 25394246 DOI: 10.1097/shk.0000000000000296] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
It is necessary to quickly and accurately determine Advanced Trauma Life Support (ATLS) hemorrhagic shock class for triage in cases of acute hemorrhage caused by trauma. However, the ATLS classification has limitations, namely, with regard to primary vital signs. This study identified the optimal variables for appropriate triage of hemorrhage severity, including the peripheral perfusion index and serum lactate concentration in addition to the conventional primary vital signs. To predict the four ATLS classes, three popular machine learning algorithms with four feature selection methods for multicategory classification were applied to a rat model of acute hemorrhage. A total of 78 anesthetized rats were divided into four groups for ATLS classification based on blood loss (in percent). The support vector machine one-versus-one model with the Kruskal-Wallis feature selection method performed best, with 80.8% accuracy, relative classifier information of 0.629, and a kappa index of 0.732. The new hemorrhage-induced severity index (lactate concentration/perfusion index), diastolic blood pressure, mean arterial pressure, and the perfusion index were selected as the optimal variables for predicting the four ATLS classes by support vector machine one-versus-one with the Kruskal-Wallis method. These four variables were also selected for binary classification to predict ATLS classes I and II versus III and IV for blood transfusion requirement. The suggested ATLS classification system would be helpful to first responders by indicating the severity of patients, allowing physicians to prepare suitable resuscitation before hospital arrival, which could hasten treatment initiation.
Collapse
Affiliation(s)
- Soo Beom Choi
- *Department of Medical Engineering, Yonsei University College of Medicine; †Brain Korea 21 PLUS Project for Medical Science, Yonsei University; ‡Department of Medicine, Yonsei University College of Medicine; and §Graduate Program in Biomedical Engineering, Yonsei University, Seoul, Republic of Korea
| | | | | | | | | |
Collapse
|
110
|
Park JS, Choi SB, Kim HJ, Cho NH, Kim SW, Kim YT, Nam EJ, Chung JW, Kim DW. Intraoperative Diagnosis Support Tool for Serous Ovarian Tumors Based on Microarray Data Using Multicategory Machine Learning. Int J Gynecol Cancer 2016; 26:104-13. [PMID: 26512784 DOI: 10.1097/igc.0000000000000566] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
OBJECTIVES Serous borderline ovarian tumors (SBOTs) are a subtype of serous ovarian carcinoma with atypical proliferation. Frozen-section diagnosis has been used as an intraoperative diagnosis tool in supporting the fertility-sparing surgery by diagnosing SBOTs with accuracy of 48% to 79%. Using DNA microarray technology, we designed multicategory classification models to support frozen-section diagnosis within 30 minutes. MATERIALS AND METHODS We systematically evaluated 6 machine learning algorithms and 3 feature selection methods using 5-fold cross-validation and a grid search on microarray data obtained from the National Center for Biotechnology Information. To validate the models and selected biomarkers, expression profiles were analyzed in tissue samples obtained from the Yonsei University College of Medicine. RESULTS The best accuracy of the optimal machine learning model was 97.3%. In addition, 5 features, including the expression of the putative biomarkers SNTN and AOX1, were selected to differentiate between normal, SBOT, and serous ovarian carcinoma groups. Different expression levels of SNTN and AOX1 were validated by real-time quantitative reverse-transcription polymerase chain reaction, Western blotting, and immunohistochemistry. A multinomial logistic regression model using SNTN and AOX1 alone was used to construct a simple-to-use equation that gave a diagnostic test accuracy of 91.9%. CONCLUSIONS We identified 2 biomarkers, SNTN and AOX1, that are likely involved in the pathogenesis and progression of ovarian tumors. An accurate diagnosis of ovarian tumor subclasses by application of the equation in conjunction with expression analysis of SNTN and AOX1 would offer a new accurate diagnosis tool in conjunction with frozen-section diagnosis within 30 minutes.
Collapse
MESH Headings
- Biomarkers, Tumor/genetics
- Blotting, Western
- Carcinoma, Ovarian Epithelial
- Cystadenocarcinoma, Serous/classification
- Cystadenocarcinoma, Serous/diagnosis
- Cystadenocarcinoma, Serous/genetics
- Cystadenocarcinoma, Serous/surgery
- Databases, Genetic
- Female
- Gene Expression Profiling
- Gene Expression Regulation, Neoplastic
- Humans
- Immunoenzyme Techniques
- Machine Learning
- Monitoring, Intraoperative/methods
- Neoplasm Staging
- Neoplasms, Glandular and Epithelial/classification
- Neoplasms, Glandular and Epithelial/diagnosis
- Neoplasms, Glandular and Epithelial/genetics
- Neoplasms, Glandular and Epithelial/surgery
- Ovarian Neoplasms/classification
- Ovarian Neoplasms/diagnosis
- Ovarian Neoplasms/genetics
- Ovarian Neoplasms/surgery
- Predictive Value of Tests
- Prognosis
- RNA, Messenger/genetics
- Real-Time Polymerase Chain Reaction
- Reverse Transcriptase Polymerase Chain Reaction
- Support Vector Machine
- Survival Rate
Collapse
Affiliation(s)
- Jee Soo Park
- *Department of Medical Engineering, Yonsei University College of Medicine; †Department of Medicine, Yonsei University College of Medicine, Seoul, Korea; ‡Graduate Program in Biomedical Engineering, Yonsei University, Seoul, Korea; and Department of §Obstetrics and Gynecology and ∥Pathology, Yonsei University College of Medicine, Seoul, Korea
| | | | | | | | | | | | | | | | | |
Collapse
|
111
|
de Boer TE, Janssens TKS, Legler J, van Straalen NM, Roelofs D. Combined Transcriptomics Analysis for Classification of Adverse Effects As a Potential End Point in Effect Based Screening. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2015; 49:14274-14281. [PMID: 26523736 DOI: 10.1021/acs.est.5b03443] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Environmental risk assessment relies on the use of bioassays to assess the environmental impact of chemicals. Gene expression is gaining acceptance as a valuable mechanistic end point in bioassays and effect-based screening. Data analysis and its results, however, are complex and often not directly applicable in risk assessment. Classifier analysis is a promising method to turn complex gene expression analysis results into answers suitable for risk assessment. We have assembled a large gene expression data set assembled from multiple studies and experiments in the springtail Folsomia candida, with the aim of selecting a set of genes that can be trained to classify general toxic stress. By performing differential expression analysis prior to classifier training, we were able to select a set of 135 genes which was enriched in stress related processes. Classifier models from this set were used to classify two test sets comprised of chemical spiked, polluted, and clean soils and compared to another, more traditional classifier feature selection. The gene set presented here outperformed the more traditionally selected gene set. This gene set has the potential to be used as a biomarker to test for adverse effects caused by chemicals in springtails to provide end points in environmental risk assessment.
Collapse
Affiliation(s)
- Tjalf E de Boer
- Amsterdam Global Change Institute, VU University Amsterdam , De Boelelaan 1085, 1081 HV Amsterdam, The Netherlands
- Department of Ecological Science, Faculty of Earth and Life Sciences, VU University Amsterdam , De Boelelaan 1085, 1081 HV Amsterdam, The Netherlands
| | | | - Juliette Legler
- Institute for Environmental Studies, Faculty of Earth and Life Sciences, VU University Amsterdam , De Boelelaan 1085, 1081 HV Amsterdam, The Netherlands
| | - Nico M van Straalen
- Department of Ecological Science, Faculty of Earth and Life Sciences, VU University Amsterdam , De Boelelaan 1085, 1081 HV Amsterdam, The Netherlands
| | - Dick Roelofs
- Department of Ecological Science, Faculty of Earth and Life Sciences, VU University Amsterdam , De Boelelaan 1085, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
112
|
High-Performance Multiclass Classification Framework Using Cloud Computing Architecture. J Med Biol Eng 2015. [DOI: 10.1007/s40846-015-0100-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
113
|
Jung S, Bi Y, Davuluri RV. Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping. BMC Genomics 2015; 16 Suppl 11:S3. [PMID: 26576613 PMCID: PMC4652565 DOI: 10.1186/1471-2164-16-s11-s3] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Many supervised learning algorithms have been applied in deriving gene signatures for patient stratification from gene expression data. However, transferring the multi-gene signatures from one analytical platform to another without loss of classification accuracy is a major challenge. Here, we compared three unsupervised data discretization methods--Equal-width binning, Equal-frequency binning, and k-means clustering--in accurately classifying the four known subtypes of glioblastoma multiforme (GBM) when the classification algorithms were trained on the isoform-level gene expression profiles from exon-array platform and tested on the corresponding profiles from RNA-seq data. RESULTS We applied an integrated machine learning framework that involves three sequential steps; feature selection, data discretization, and classification. For models trained and tested on exon-array data, the addition of data discretization step led to robust and accurate predictive models with fewer number of variables in the final models. For models trained on exon-array data and tested on RNA-seq data, the addition of data discretization step dramatically improved the classification accuracies with Equal-frequency binning showing the highest improvement with more than 90% accuracies for all the models with features chosen by Random Forest based feature selection. Overall, SVM classifier coupled with Equal-frequency binning achieved the best accuracy (> 95%). Without data discretization, however, only 73.6% accuracy was achieved at most. CONCLUSIONS The classification algorithms, trained and tested on data from the same platform, yielded similar accuracies in predicting the four GBM subgroups. However, when dealing with cross-platform data, from exon-array to RNA-seq, the classifiers yielded stable models with highest classification accuracies on data transformed by Equal frequency binning. The approach presented here is generally applicable to other cancer types for classification and identification of molecular subgroups by integrating data across different gene expression platforms.
Collapse
Affiliation(s)
- Segun Jung
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA
| | - Yingtao Bi
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA
| | - Ramana V Davuluri
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA
| |
Collapse
|
114
|
Choi SB, Park JS, Chung JW, Yoo TK, Kim DW. Multicategory classification of 11 neuromuscular diseases based on microarray data using support vector machine. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2015; 2014:3460-3. [PMID: 25570735 DOI: 10.1109/embc.2014.6944367] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
We applied multicategory machine learning methods to classify 11 neuromuscular disease groups and one control group based on microarray data. To develop multicategory classification models with optimal parameters and features, we performed a systematic evaluation of three machine learning algorithms and four feature selection methods using three-fold cross validation and a grid search. This study included 114 subjects of 11 neuromuscular diseases and 31 subjects of a control group using microarray data with 22,283 probe sets from the National Center for Biotechnology Information (NCBI). We obtained an accuracy of 100%, relative classifier information (RCI) of 1.0, and a kappa index of 1.0 by applying the models of support vector machines one-versus-one (SVM-OVO), SVM one-versus-rest (OVR), and directed acyclic graph SVM (DAGSVM), using the ratio of genes between categories to within-category sums of squares (BW) feature selection method. Each of these three models selected only four features to categorize the 12 groups, resulting in a time-saving and cost-effective strategy for diagnosing neuromuscular diseases. In addition, a gene symbol, SPP1 was selected as the top-ranked gene by the BW method. We confirmed relationships between the gene (SPP1) and Duchenne muscular dystrophy (DMD) from a previous study. With our models as clinically helpful tools, neuromuscular diseases could be classified quickly using a computer, thereby giving a time-saving, cost-effective, and accurate diagnosis.
Collapse
|
115
|
Tsamardinos I, Rakhshani A, Lagani V. Performance-Estimation Properties of Cross-Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization. INT J ARTIF INTELL T 2015. [DOI: 10.1142/s0218213015400230] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In a typical supervised data analysis task, one needs to perform the following two tasks: (a) select an optimal combination of learning methods (e.g., for variable selection and classifier) and tune their hyper-parameters (e.g., K in K-NN), also called model selection, and (b) provide an estimate of the performance of the final, reported model. Combining the two tasks is not trivial because when one selects the set of hyper-parameters that seem to provide the best estimated performance, this estimation is optimistic (biased/overfitted) due to performing multiple statistical comparisons. In this paper, we discuss the theoretical properties of performance estimation when model selection is present and we confirm that the simple Cross-Validation with model selection is indeed optimistic (overestimates performance) in small sample scenarios and should be avoided. We present in detail and investigate the theoretical properties of the Nested Cross Validation and a method by Tibshirani and Tibshirani for removing the estimation bias. In computational experiments with real datasets both protocols provide conservative estimation of performance and should be preferred. These statements hold true even if feature selection is performed as preprocessing.
Collapse
Affiliation(s)
- Ioannis Tsamardinos
- Department of Computer Science, University of Crete, Crete, Greece
- Institute of Computer Science, Foundation for Research and Technology — Hellas (FORTH), Heraklion Campus, Voutes, Heraklion, GR-700 13, Greece
| | - Amin Rakhshani
- Department of Computer Science, University of Crete, Crete, Greece
- Institute of Computer Science, Foundation for Research and Technology — Hellas (FORTH), Heraklion Campus, Voutes, Heraklion, GR-700 13, Greece
| | - Vincenzo Lagani
- Institute of Computer Science, Foundation for Research and Technology — Hellas (FORTH), Vassilika Vouton, Heraklion, GR-700 13, Greece
| |
Collapse
|
116
|
Cancer classification using a novel gene selection approach by means of shuffling based on data clustering with optimization. Appl Soft Comput 2015. [DOI: 10.1016/j.asoc.2015.06.015] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|
117
|
Boulesteix AL, Hable R, Lauer S, Eugster MJA. A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies. AM STAT 2015. [DOI: 10.1080/00031305.2015.1005128] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
118
|
Li H, Zhao C, Shao F, Li GZ, Wang X. A hybrid imputation approach for microarray missing value estimation. BMC Genomics 2015; 16 Suppl 9:S1. [PMID: 26330180 PMCID: PMC4547405 DOI: 10.1186/1471-2164-16-s9-s1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Missing data is an inevitable phenomenon in gene expression microarray experiments due to instrument failure or human error. It has a negative impact on performance of downstream analysis. Technically, most existing approaches suffer from this prevalent problem. Imputation is one of the frequently used methods for processing missing data. Actually many developments have been achieved in the research on estimating missing values. The challenging task is how to improve imputation accuracy for data with a large missing rate. METHODS In this paper, induced by the thought of collaborative training, we propose a novel hybrid imputation method, called Recursive Mutual Imputation (RMI). Specifically, RMI exploits global correlation information and local structure in the data, captured by two popular methods, Bayesian Principal Component Analysis (BPCA) and Local Least Squares (LLS), respectively. Mutual strategy is implemented by sharing the estimated data sequences at each recursive process. Meanwhile, we consider the imputation sequence based on the number of missing entries in the target gene. Furthermore, a weight based integrated method is utilized in the final assembling step. RESULTS We evaluate RMI with three state-of-art algorithms (BPCA, LLS, Iterated Local Least Squares imputation (ItrLLS)) on four publicly available microarray datasets. Experimental results clearly demonstrate that RMI significantly outperforms comparative methods in terms of Normalized Root Mean Square Error (NRMSE), especially for datasets with large missing rates and less complete genes. CONCLUSIONS It is noted that our proposed hybrid imputation approach incorporates both global and local information of microarray genes, which achieves lower NRMSE values against to any single approach only. Besides, this study highlights the need for considering the imputing sequence of missing entries for imputation methods.
Collapse
|
119
|
Lagani V, Chiarugi F, Manousos D, Verma V, Fursse J, Marias K, Tsamardinos I. Realization of a service for the long-term risk assessment of diabetes-related complications. J Diabetes Complications 2015; 29:691-8. [PMID: 25953402 DOI: 10.1016/j.jdiacomp.2015.03.011] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/28/2014] [Revised: 03/06/2015] [Accepted: 03/17/2015] [Indexed: 11/21/2022]
Abstract
AIM We present a computerized system for the assessment of the long-term risk of developing diabetes-related complications. METHODS The core of the system consists of a set of predictive models, developed through a data-mining/machine-learning approach, which are able to evaluate individual patient profiles and provide personalized risk assessments. Missing data is a common issue in (electronic) patient records, thus the models are paired with a module for the intelligent management of missing information. RESULTS The system has been deployed and made publicly available as Web service, and it has been fully integrated within the diabetes-management platform developed by the European project REACTION. Preliminary usability tests showed that the clinicians judged the models useful for risk assessment and for communicating the risk to the patient. Furthermore, the system performs as well as the United Kingdom Prospective Diabetes Study (UKPDS) Risk Engine when both systems are tested on an independent cohort of UK diabetes patients. CONCLUSIONS Our work provides a working example of risk-stratification tool that is (a) specific for diabetes patients, (b) able to handle several different diabetes related complications, (c) performing as well as the widely known UKPDS Risk Engine on an external validation cohort.
Collapse
Affiliation(s)
- Vincenzo Lagani
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece.
| | - Franco Chiarugi
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece
| | - Dimitris Manousos
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece
| | - Vivek Verma
- Department of Information Systems, Computing and Mathematics, Brunel University, Uxbridge, United Kingdom
| | - Joanna Fursse
- Chorleywood Health Center, Chorleywood, United Kingdom
| | - Kostas Marias
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece
| | - Ioannis Tsamardinos
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece; Department of Computer Science, University of Crete, Heraklion, Greece
| |
Collapse
|
120
|
Factors affecting the accuracy of a class prediction model in gene expression data. BMC Bioinformatics 2015; 16:199. [PMID: 26093633 PMCID: PMC4475623 DOI: 10.1186/s12859-015-0610-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2015] [Accepted: 04/30/2015] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Class prediction models have been shown to have varying performances in clinical gene expression datasets. Previous evaluation studies, mostly done in the field of cancer, showed that the accuracy of class prediction models differs from dataset to dataset and depends on the type of classification function. While a substantial amount of information is known about the characteristics of classification functions, little has been done to determine which characteristics of gene expression data have impact on the performance of a classifier. This study aims to empirically identify data characteristics that affect the predictive accuracy of classification models, outside of the field of cancer. RESULTS Datasets from twenty five studies meeting predefined inclusion and exclusion criteria were downloaded. Nine classification functions were chosen, falling within the categories: discriminant analyses or Bayes classifiers, tree based, regularization and shrinkage and nearest neighbors methods. Consequently, nine class prediction models were built for each dataset using the same procedure and their performances were evaluated by calculating their accuracies. The characteristics of each experiment were recorded, (i.e., observed disease, medical question, tissue/cell types and sample size) together with characteristics of the gene expression data, namely the number of differentially expressed genes, the fold changes and the within-class correlations. Their effects on the accuracy of a class prediction model were statistically assessed by random effects logistic regression. The number of differentially expressed genes and the average fold change had significant impact on the accuracy of a classification model and gave individual explained-variation in prediction accuracy of up to 72% and 57%, respectively. Multivariable random effects logistic regression with forward selection yielded the two aforementioned study factors and the within class correlation as factors affecting the accuracy of classification functions, explaining 91.5% of the between study variation. CONCLUSIONS We evaluated study- and data-related factors that might explain the varying performances of classification functions in non-cancerous datasets. Our results showed that the number of differentially expressed genes, the fold change, and the correlation in gene expression data significantly affect the accuracy of class prediction models.
Collapse
|
121
|
Hybrid Classification Techniques for Microarray Data. NATIONAL ACADEMY SCIENCE LETTERS-INDIA 2015. [DOI: 10.1007/s40009-015-0390-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
|
122
|
Aibar S, Fontanillo C, Droste C, Roson-Burgo B, Campos-Laborie FJ, Hernandez-Rivas JM, De Las Rivas J. Analyse multiple disease subtypes and build associated gene networks using genome-wide expression profiles. BMC Genomics 2015; 16 Suppl 5:S3. [PMID: 26040557 PMCID: PMC4460584 DOI: 10.1186/1471-2164-16-s5-s3] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Despite the large increase of transcriptomic studies that look for gene signatures on diseases, there is still a need for integrative approaches that obtain separation of multiple pathological states providing robust selection of gene markers for each disease subtype and information about the possible links or relations between those genes. RESULTS We present a network-oriented and data-driven bioinformatic approach that searches for association of genes and diseases based on the analysis of genome-wide expression data derived from microarrays or RNA-Seq studies. The approach aims to (i) identify gene sets associated to different pathological states analysed together; (ii) identify a minimum subset within these genes that unequivocally differentiates and classifies the compared disease subtypes; (iii) provide a measurement of the discriminant power of these genes and (iv) identify links between the genes that characterise each of the disease subtypes. This bioinformatic approach is implemented in an R package, named geNetClassifier, available as an open access tool in Bioconductor. To illustrate the performance of the tool, we applied it to two independent datasets: 250 samples from patients with four major leukemia subtypes analysed using expression arrays; another leukemia dataset analysed with RNA-Seq that includes a subtype also present in the previous set. The results show the selection of key deregulated genes recently reported in the literature and assigned to the leukemia subtypes studied. We also show, using these independent datasets, the selection of similar genes in a network built for the same disease subtype. CONCLUSIONS The construction of gene networks related to specific disease subtypes that include parameters such as gene-to-gene association, gene disease specificity and gene discriminant power can be very useful to draw gene-disease maps and to unravel the molecular features that characterize specific pathological states. The application of the bioinformatic tool here presented shows a neat way to achieve such molecular characterization of the diseases using genome-wide expression data.
Collapse
|
123
|
Ramyachitra D, Sofia M, Manikandan P. Interval-value Based Particle Swarm Optimization algorithm for cancer-type specific gene selection and sample classification. GENOMICS DATA 2015; 5:46-50. [PMID: 26484222 PMCID: PMC4583628 DOI: 10.1016/j.gdata.2015.04.027] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Revised: 04/27/2015] [Accepted: 04/29/2015] [Indexed: 11/26/2022]
Abstract
Microarray technology allows simultaneous measurement of the expression levels of thousands of genes within a biological tissue sample. The fundamental power of microarrays lies within the ability to conduct parallel surveys of gene expression using microarray data. The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high compared to the number of data samples. Thus the difficulty that lies with data are of high dimensionality and the sample size is small. This research work addresses the problem by classifying resultant dataset using the existing algorithms such as Support Vector Machine (SVM), K-nearest neighbor (KNN), Interval Valued Classification (IVC) and the improvised Interval Value based Particle Swarm Optimization (IVPSO) algorithm. Thus the results show that the IVPSO algorithm outperformed compared with other algorithms under several performance evaluation functions.
Collapse
Affiliation(s)
- D Ramyachitra
- Department of Computer Science, Bharathiar University, Coimbatore 641046, India
| | - M Sofia
- Department of Computer Science, Bharathiar University, Coimbatore 641046, India
| | - P Manikandan
- Department of Computer Science, Bharathiar University, Coimbatore 641046, India
| |
Collapse
|
124
|
Bermingham ML, Pong-Wong R, Spiliopoulou A, Hayward C, Rudan I, Campbell H, Wright AF, Wilson JF, Agakov F, Navarro P, Haley CS. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci Rep 2015; 5:10312. [PMID: 25988841 PMCID: PMC4437376 DOI: 10.1038/srep10312] [Citation(s) in RCA: 126] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2014] [Accepted: 04/08/2015] [Indexed: 01/20/2023] Open
Abstract
In this study, we investigated the effect of five feature selection approaches on the performance of a mixed model (G-BLUP) and a Bayesian (Bayes C) prediction method. We predicted height, high density lipoprotein cholesterol (HDL) and body mass index (BMI) within 2,186 Croatian and into 810 UK individuals using genome-wide SNP data. Using all SNP information Bayes C and G-BLUP had similar predictive performance across all traits within the Croatian data, and for the highly polygenic traits height and BMI when predicting into the UK data. Bayes C outperformed G-BLUP in the prediction of HDL, which is influenced by loci of moderate size, in the UK data. Supervised feature selection of a SNP subset in the G-BLUP framework provided a flexible, generalisable and computationally efficient alternative to Bayes C; but careful evaluation of predictive performance is required when supervised feature selection has been used.
Collapse
Affiliation(s)
- M. L. Bermingham
- MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh
| | - R. Pong-Wong
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh
| | - A. Spiliopoulou
- MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh
| | - C. Hayward
- MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh
| | - I. Rudan
- Centre for Population Health Sciences, University of Edinburgh
| | - H. Campbell
- Centre for Population Health Sciences, University of Edinburgh
| | - A. F. Wright
- MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh
| | - J. F. Wilson
- Centre for Population Health Sciences, University of Edinburgh
| | | | - P. Navarro
- MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh
| | - C. S. Haley
- MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh
| |
Collapse
|
125
|
Yang S, Naiman DQ. Multiclass cancer classification based on gene expression comparison. Stat Appl Genet Mol Biol 2015; 13:477-96. [PMID: 24918456 DOI: 10.1515/sagmb-2013-0053] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
As the complexity and heterogeneity of cancer is being increasingly appreciated through genomic analyses, microarray-based cancer classification comprising multiple discriminatory molecular markers is an emerging trend. Such multiclass classification problems pose new methodological and computational challenges for developing novel and effective statistical approaches. In this paper, we introduce a new approach for classifying multiple disease states associated with cancer based on gene expression profiles. Our method focuses on detecting small sets of genes in which the relative comparison of their expression values leads to class discrimination. For an m-class problem, the classification rule typically depends on a small number of m-gene sets, which provide transparent decision boundaries and allow for potential biological interpretations. We first test our approach on seven common gene expression datasets and compare it with popular classification methods including support vector machines and random forests. We then consider an extremely large cohort of leukemia cancer patients to further assess its effectiveness. In both experiments, our method yields comparable or even better results to benchmark classifiers. In addition, we demonstrate that our approach can integrate pathway analysis of gene expression to provide accurate and biological meaningful classification.
Collapse
|
126
|
Yu DJ, Li Y, Hu J, Yang X, Yang JY, Shen HB. Disulfide Connectivity Prediction Based on Modelled Protein 3D Structural Information and Random Forest Regression. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:611-621. [PMID: 26357272 DOI: 10.1109/tcbb.2014.2359451] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Disulfide connectivity is an important protein structural characteristic. Accurately predicting disulfide connectivity solely from protein sequence helps to improve the intrinsic understanding of protein structure and function, especially in the post-genome era where large volume of sequenced proteins without being functional annotated is quickly accumulated. In this study, a new feature extracted from the predicted protein 3D structural information is proposed and integrated with traditional features to form discriminative features. Based on the extracted features, a random forest regression model is performed to predict protein disulfide connectivity. We compare the proposed method with popular existing predictors by performing both cross-validation and independent validation tests on benchmark datasets. The experimental results demonstrate the superiority of the proposed method over existing predictors. We believe the superiority of the proposed method benefits from both the good discriminative capability of the newly developed features and the powerful modelling capability of the random forest. The web server implementation, called TargetDisulfide, and the benchmark datasets are freely available at: http://csbio.njust.edu.cn/bioinf/TargetDisulfide for academic use.
Collapse
|
127
|
Lagani V, Chiarugi F, Thomson S, Fursse J, Lakasing E, Jones RW, Tsamardinos I. Development and validation of risk assessment models for diabetes-related complications based on the DCCT/EDIC data. J Diabetes Complications 2015; 29:479-87. [PMID: 25772254 DOI: 10.1016/j.jdiacomp.2015.03.001] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/28/2014] [Revised: 02/10/2015] [Accepted: 03/01/2015] [Indexed: 01/28/2023]
Abstract
AIM To derive and validate a set of computational models able to assess the risk of developing complications and experiencing adverse events for patients with diabetes. The models are developed on data from the Diabetes Control and Complications Trial (DCCT) and the Epidemiology of Diabetes Interventions and Complications (EDIC) studies, and are validated on an external, retrospectively collected cohort. METHODS We selected fifty-one clinical parameters measured at baseline during the DCCT as potential risk factors for the following adverse outcomes: Cardiovascular Diseases (CVD), Hypoglycemia, Ketoacidosis, Microalbuminuria, Proteinuria, Neuropathy and Retinopathy. For each outcome we applied a data-mining analysis protocol in order to identify the best-performing signature, i.e., the smallest set of clinical parameters that, considered jointly, are maximally predictive for the selected outcome. The predictive models built on the selected signatures underwent both an interval validation on the DCCT/EDIC data and an external validation on a retrospective cohort of 393 diabetes patients (49 Type I and 344 Type II) from the Chorleywood Medical Center, UK. RESULTS The selected predictive signatures contain five to fifteen risk factors, depending on the specific outcome. Internal validation performances, as measured by the Concordance Index (CI), range from 0.62 to 0.83, indicating good predictive power. The models achieved comparable performances for the Type I and, quite surprisingly, Type II external cohort. CONCLUSIONS Data-mining analyses of the DCCT/EDIC data allow the identification of accurate predictive models for diabetes-related complications. We also present initial evidences that these models can be applied on a more recent, European population.
Collapse
Affiliation(s)
- Vincenzo Lagani
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece.
| | - Franco Chiarugi
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece
| | - Shona Thomson
- Herts Valley Clinical Commission Group, Hertfordshire, United Kingdom
| | - Jo Fursse
- Chorleywood Health Center, Chorleywood, United Kingdom
| | - Edin Lakasing
- Chorleywood Health Center, Chorleywood, United Kingdom
| | | | - Ioannis Tsamardinos
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece; Computer Science Department, University of Crete, Heraklion, Greece
| |
Collapse
|
128
|
Gong H, Klinger J, Damazyn K, Li X, Huang S. A novel procedure for statistical inference and verification of gene regulatory subnetwork. BMC Bioinformatics 2015; 16 Suppl 7:S7. [PMID: 25952938 PMCID: PMC4423581 DOI: 10.1186/1471-2105-16-s7-s7] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Background The reconstruction of gene regulatory network from time course microarray data can help us comprehensively understand the biological system and discover the pathogenesis of cancer and other diseases. But how to correctly and efficiently decifer the gene regulatory network from high-throughput gene expression data is a big challenge due to the relatively small amount of observations and curse of dimensionality. Computational biologists have developed many statistical inference and machine learning algorithms to analyze the microarray data. In the previous studies, the correctness of an inferred regulatory network is manually checked through comparing with public database or an existing model. Results In this work, we present a novel procedure to automatically infer and verify gene regulatory networks from time series expression data. The dynamic Bayesian network, a statistical inference algorithm, is at first implemented to infer an optimal network from time series microarray data of S. cerevisiae, then, a weighted symbolic model checker is applied to automatically verify or falsify the inferred network through checking some desired temporal logic formulas abstracted from experiments or public database. Conclusions Our studies show that the marriage of statistical inference algorithm with model checking technique provides a more efficient way to automatically infer and verify the gene regulatory network from time series expression data than previous studies.
Collapse
|
129
|
|
130
|
New feature selection for gene expression classification based on degree of class overlap in principal dimensions. Comput Biol Med 2015; 64:292-8. [PMID: 25712072 DOI: 10.1016/j.compbiomed.2015.01.022] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2014] [Revised: 01/29/2015] [Accepted: 01/30/2015] [Indexed: 11/21/2022]
Abstract
Micro-array data are typically characterized by high dimensional features with a small number of samples. Several problems in identifying genes causing diseases from micro-array data can be transformed into the problem of classifying the features extracted from gene expression in micro-array data. However, too many features can cause low prediction accuracy as well as high computational complexity. Dimensional reduction is a method to eliminate irrelevant features to improve the prediction accuracy. Typically, the eigenvalues or dimensional data variance from principal component analysis are used as criteria to select relevant features. This approach is simple but not efficient since it does not concern the degree of data overlap in each dimension in the feature space. A new method to select relevant features based on degree of dimensional data overlap with proper feature selection was introduced. Furthermore, our study concentrated on small sized data sets which usually occur in reality. The experimental results signified that this new approach can achieve substantially higher prediction accuracy when compared with other methods.
Collapse
|
131
|
Motai Y. Kernel association for classification and prediction: a survey. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2015; 26:208-223. [PMID: 25029489 DOI: 10.1109/tnnls.2014.2333664] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Kernel association (KA) in statistical pattern recognition used for classification and prediction have recently emerged in a machine learning and signal processing context. This survey outlines the latest trends and innovations of a kernel framework for big data analysis. KA topics include offline learning, distributed database, online learning, and its prediction. The structural presentation and the comprehensive list of references are geared to provide a useful overview of this evolving field for both specialists and relevant scholars.
Collapse
|
132
|
Park JS, Choi SB, Chung JW, Kim SW, Kim DW. Classification of serous ovarian tumors based on microarray data using multicategory support vector machines. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2015; 2014:3430-3. [PMID: 25570728 DOI: 10.1109/embc.2014.6944360] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Ovarian cancer, the most fatal of reproductive cancers, is the fifth leading cause of death in women in the United States. Serous borderline ovarian tumors (SBOTs) are considered to be earlier or less malignant forms of serous ovarian carcinomas (SOCs). SBOTs are asymptomatic and progression to advanced stages is common. Using DNA microarray technology, we designed multicategory classification models to discriminate ovarian cancer subclasses. To develop multicategory classification models with optimal parameters and features, we systematically evaluated three machine learning algorithms and three feature selection methods using five-fold cross validation and a grid search. The study included 22 subjects with normal ovarian surface epithelial cells, 12 with SBOTs, and 79 with SOCs according to microarray data with 54,675 probe sets obtained from the National Center for Biotechnology Information gene expression omnibus repository. Application of the optimal model of support vector machines one-versus-rest with signal-to-noise as a feature selection method gave an accuracy of 97.3%, relative classifier information of 0.916, and a kappa index of 0.941. In addition, 5 features, including the expression of putative biomarkers SNTN and AOX1, were selected to differentiate between normal, SBOT, and SOC groups. An accurate diagnosis of ovarian tumor subclasses by application of multicategory machine learning would be cost-effective and simple to perform, and would ensure more effective subclass-targeted therapy.
Collapse
|
133
|
Fernandes JA, Irigoien X, Lozano JA, Inza I, Goikoetxea N, Pérez A. Evaluating machine-learning techniques for recruitment forecasting of seven North East Atlantic fish species. ECOL INFORM 2015. [DOI: 10.1016/j.ecoinf.2014.11.004] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
134
|
Tiwari AK, Srivastava R. A survey of computational intelligence techniques in protein function prediction. INTERNATIONAL JOURNAL OF PROTEOMICS 2014; 2014:845479. [PMID: 25574395 PMCID: PMC4276698 DOI: 10.1155/2014/845479] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Revised: 10/31/2014] [Accepted: 11/07/2014] [Indexed: 02/08/2023]
Abstract
During the past, there was a massive growth of knowledge of unknown proteins with the advancement of high throughput microarray technologies. Protein function prediction is the most challenging problem in bioinformatics. In the past, the homology based approaches were used to predict the protein function, but they failed when a new protein was different from the previous one. Therefore, to alleviate the problems associated with homology based traditional approaches, numerous computational intelligence techniques have been proposed in the recent past. This paper presents a state-of-the-art comprehensive review of various computational intelligence techniques for protein function predictions using sequence, structure, protein-protein interaction network, and gene expression data used in wide areas of applications such as prediction of DNA and RNA binding sites, subcellular localization, enzyme functions, signal peptides, catalytic residues, nuclear/G-protein coupled receptors, membrane proteins, and pathway analysis from gene expression datasets. This paper also summarizes the result obtained by many researchers to solve these problems by using computational intelligence techniques with appropriate datasets to improve the prediction performance. The summary shows that ensemble classifiers and integration of multiple heterogeneous data are useful for protein function prediction.
Collapse
Affiliation(s)
- Arvind Kumar Tiwari
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| | - Rajeev Srivastava
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| |
Collapse
|
135
|
Lotfi E, Keshavarz A. Gene expression microarray classification using PCA–BEL. Comput Biol Med 2014; 54:180-7. [DOI: 10.1016/j.compbiomed.2014.09.008] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2014] [Revised: 09/13/2014] [Accepted: 09/16/2014] [Indexed: 01/15/2023]
|
136
|
Kerkentzes K, Lagani V, Tsamardinos I, Vyberg M, Røe OD. Hidden treasures in "ancient" microarrays: gene-expression portrays biology and potential resistance pathways of major lung cancer subtypes and normal tissue. Front Oncol 2014; 4:251. [PMID: 25325012 PMCID: PMC4178426 DOI: 10.3389/fonc.2014.00251] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2014] [Accepted: 09/02/2014] [Indexed: 11/22/2022] Open
Abstract
Objective: Novel statistical methods and increasingly more accurate gene annotations can transform “old” biological data into a renewed source of knowledge with potential clinical relevance. Here, we provide an in silico proof-of-concept by extracting novel information from a high-quality mRNA expression dataset, originally published in 2001, using state-of-the-art bioinformatics approaches. Methods: The dataset consists of histologically defined cases of lung adenocarcinoma (AD), squamous (SQ) cell carcinoma, small-cell lung cancer, carcinoid, metastasis (breast and colon AD), and normal lung specimens (203 samples in total). A battery of statistical tests was used for identifying differential gene expressions, diagnostic and prognostic genes, enriched gene ontologies, and signaling pathways. Results: Our results showed that gene expressions faithfully recapitulate immunohistochemical subtype markers, as chromogranin A in carcinoids, cytokeratin 5, p63 in SQ, and TTF1 in non-squamous types. Moreover, biological information with putative clinical relevance was revealed as potentially novel diagnostic genes for each subtype with specificity 93–100% (AUC = 0.93–1.00). Cancer subtypes were characterized by (a) differential expression of treatment target genes as TYMS, HER2, and HER3 and (b) overrepresentation of treatment-related pathways like cell cycle, DNA repair, and ERBB pathways. The vascular smooth muscle contraction, leukocyte trans-endothelial migration, and actin cytoskeleton pathways were overexpressed in normal tissue. Conclusion: Reanalysis of this public dataset displayed the known biological features of lung cancer subtypes and revealed novel pathways of potentially clinical importance. The findings also support our hypothesis that even old omics data of high quality can be a source of significant biological information when appropriate bioinformatics methods are used.
Collapse
Affiliation(s)
- Konstantinos Kerkentzes
- Department of Computer Science, University of Crete , Heraklion , Greece ; Institute of Computer Science, Foundation of Research and Technology - Hellas , Heraklion , Greece
| | - Vincenzo Lagani
- Institute of Computer Science, Foundation of Research and Technology - Hellas , Heraklion , Greece
| | - Ioannis Tsamardinos
- Department of Computer Science, University of Crete , Heraklion , Greece ; Institute of Computer Science, Foundation of Research and Technology - Hellas , Heraklion , Greece
| | - Mogens Vyberg
- Institute of Pathology, Aalborg University Hospital , Aalborg , Denmark
| | - Oluf Dimitri Røe
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology , Trondheim , Norway ; Department of Oncology, Clinical Cancer Research Center, Aalborg University Hospital , Aalborg , Denmark ; Cancer Clinic, Levanger Hospital, Nord-Trøndelag Health Trust , Levanger , Norway
| |
Collapse
|
137
|
Oliveira ON, Iost RM, Siqueira JR, Crespilho FN, Caseli L. Nanomaterials for diagnosis: challenges and applications in smart devices based on molecular recognition. ACS APPLIED MATERIALS & INTERFACES 2014; 6:14745-66. [PMID: 24968359 DOI: 10.1021/am5015056] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Clinical diagnosis has always been dependent on the efficient immobilization of biomolecules in solid matrices with preserved activity, but significant developments have taken place in recent years with the increasing control of molecular architecture in organized films. Of particular importance is the synergy achieved with distinct materials such as nanoparticles, antibodies, enzymes, and other nanostructures, forming structures organized on the nanoscale. In this review, emphasis will be placed on nanomaterials for biosensing based on molecular recognition, where the recognition element may be an enzyme, DNA, RNA, catalytic antibody, aptamer, and labeled biomolecule. All of these elements may be assembled in nanostructured films, whose layer-by-layer nature is essential for combining different properties in the same device. Sensing can be done with a number of optical, electrical, and electrochemical methods, which may also rely on nanostructures for enhanced performance, as is the case of reporting nanoparticles in bioelectronics devices. The successful design of such devices requires investigation of interface properties of functionalized surfaces, for which a variety of experimental and theoretical methods have been used. Because diagnosis involves the acquisition of large amounts of data, statistical and computational methods are now in widespread use, and one may envisage an integrated expert system where information from different sources may be mined to generate the diagnostics.
Collapse
Affiliation(s)
- Osvaldo N Oliveira
- São Carlos Institute of Physics, University of São Paulo , CP 369, 13560-970 São Carlos, São Paulo, Brazil
| | | | | | | | | |
Collapse
|
138
|
Yu DJ, Hu J, Yan H, Yang XB, Yang JY, Shen HB. Enhancing protein-vitamin binding residues prediction by multiple heterogeneous subspace SVMs ensemble. BMC Bioinformatics 2014; 15:297. [PMID: 25189131 PMCID: PMC4261549 DOI: 10.1186/1471-2105-15-297] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2014] [Accepted: 08/18/2014] [Indexed: 11/10/2022] Open
Abstract
Background Vitamins are typical ligands that play critical roles in various metabolic processes. The accurate identification of the vitamin-binding residues solely based on a protein sequence is of significant importance for the functional annotation of proteins, especially in the post-genomic era, when large volumes of protein sequences are accumulating quickly without being functionally annotated. Results In this paper, a new predictor called TargetVita is designed and implemented for predicting protein-vitamin binding residues using protein sequences. In TargetVita, features derived from the position-specific scoring matrix (PSSM), predicted protein secondary structure, and vitamin binding propensity are combined to form the original feature space; then, several feature subspaces are selected by performing different feature selection methods. Finally, based on the selected feature subspaces, heterogeneous SVMs are trained and then ensembled for performing prediction. Conclusions The experimental results obtained with four separate vitamin-binding benchmark datasets demonstrate that the proposed TargetVita is superior to the state-of-the-art vitamin-specific predictor, and an average improvement of 10% in terms of the Matthews correlation coefficient (MCC) was achieved over independent validation tests. The TargetVita web server and the datasets used are freely available for academic use at http://csbio.njust.edu.cn/bioinf/TargetVita or http://www.csbio.sjtu.edu.cn/bioinf/TargetVita. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-297) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing 210094, China.
| | | | | | | | | | | |
Collapse
|
139
|
Learning a weighted meta-sample based parameter free sparse representation classification for microarray data. PLoS One 2014; 9:e104314. [PMID: 25115965 PMCID: PMC4130588 DOI: 10.1371/journal.pone.0104314] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2014] [Accepted: 07/07/2014] [Indexed: 11/19/2022] Open
Abstract
Sparse representation classification (SRC) is one of the most promising classification methods for supervised learning. This method can effectively exploit discriminating information by introducing a regularization terms to the data. With the desirable property of sparisty, SRC is robust to both noise and outliers. In this study, we propose a weighted meta-sample based non-parametric sparse representation classification method for the accurate identification of tumor subtype. The proposed method includes three steps. First, we extract the weighted meta-samples for each sub class from raw data, and the rationality of the weighting strategy is proven mathematically. Second, sparse representation coefficients can be obtained by regularization of underdetermined linear equations. Thus, data dependent sparsity can be adaptively tuned. A simple characteristic function is eventually utilized to achieve classification. Asymptotic time complexity analysis is applied to our method. Compared with some state-of-the-art classifiers, the proposed method has lower time complexity and more flexibility. Experiments on eight samples of publicly available gene expression profile data show the effectiveness of the proposed method.
Collapse
|
140
|
Mao R, Raj Kumar PK, Guo C, Zhang Y, Liang C. Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine. PLoS One 2014; 9:e104049. [PMID: 25110928 PMCID: PMC4128822 DOI: 10.1371/journal.pone.0104049] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2014] [Accepted: 07/06/2014] [Indexed: 01/04/2023] Open
Abstract
One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.
Collapse
Affiliation(s)
- Rui Mao
- College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling, Shaanxi, China
- College of Information Engineering, Northwest A&F University, Yangling, Shaanxi, China
- Department of Biology, Miami University, Oxford, Ohio, United States of America
| | | | - Cheng Guo
- Department of Biology, Miami University, Oxford, Ohio, United States of America
| | - Yang Zhang
- College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling, Shaanxi, China
- College of Information Engineering, Northwest A&F University, Yangling, Shaanxi, China
- * E-mail: (YZ); (CL)
| | - Chun Liang
- Department of Biology, Miami University, Oxford, Ohio, United States of America
- Department of Computer Sciences and Software Engineering, Miami University, Oxford, Ohio, United States of America
- * E-mail: (YZ); (CL)
| |
Collapse
|
141
|
Mahmoud O, Harrison A, Perperoglou A, Gul A, Khan Z, Metodiev MV, Lausen B. A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinformatics 2014; 15:274. [PMID: 25113817 PMCID: PMC4141116 DOI: 10.1186/1471-2105-15-274] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2014] [Accepted: 08/01/2014] [Indexed: 11/16/2022] Open
Abstract
Background Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature’s relevance to a classification task. Results We apply POS, along‐with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance. Conclusions A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along‐with a novel gene score are exploited to produce the selected subset of genes. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-274) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Osama Mahmoud
- Department of Mathematical Sciences, University of Essex, Wivenhoe Park, CO4 3SQ Colchester, UK.
| | | | | | | | | | | | | |
Collapse
|
142
|
A discrete wavelet based feature extraction and hybrid classification technique for microarray data analysis. ScientificWorldJournal 2014; 2014:195470. [PMID: 25162043 PMCID: PMC4138760 DOI: 10.1155/2014/195470] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2014] [Revised: 06/20/2014] [Accepted: 07/02/2014] [Indexed: 11/18/2022] Open
Abstract
Cancer classification by doctors and radiologists was based on morphological and clinical features and had limited diagnostic ability in olden days. The recent arrival of DNA microarray technology has led to the concurrent monitoring of thousands of gene expressions in a single chip which stimulates the progress in cancer classification. In this paper, we have proposed a hybrid approach for microarray data classification based on nearest neighbor (KNN), naive Bayes, and support vector machine (SVM). Feature selection prior to classification plays a vital role and a feature selection technique which combines discrete wavelet transform (DWT) and moving window technique (MWT) is used. The performance of the proposed method is compared with the conventional classifiers like support vector machine, nearest neighbor, and naive Bayes. Experiments have been conducted on both real and benchmark datasets and the results indicate that the ensemble approach produces higher classification accuracy than conventional classifiers. This paper serves as an automated system for the classification of cancer and can be applied by doctors in real cases which serve as a boon to the medical community. This work further reduces the misclassification of cancers which is highly not allowed in cancer detection.
Collapse
|
143
|
Lin X, Gao J, Zhou L, Yin P, Xu G. A modified k-TSP algorithm and its application in LC-MS-based metabolomics study of hepatocellular carcinoma and chronic liver diseases. J Chromatogr B Analyt Technol Biomed Life Sci 2014; 966:100-8. [PMID: 24939728 DOI: 10.1016/j.jchromb.2014.05.044] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2013] [Revised: 05/19/2014] [Accepted: 05/21/2014] [Indexed: 01/10/2023]
Abstract
In systems biology, the ability to discern meaningful information that reflects the nature of related problems from large amounts of data has become a key issue. The classification method using top scoring pairs (TSP), which measures the features of a data set in pairs and selects the top ranked feature pairs to construct the classifier, has been a powerful tool in genomics data analysis because of its simplicity and interpretability. This study examined the relationship between two features, modified the ranking criteria of the k-TSP method to measure the discriminative ability of each feature pair more accurately, and correspondingly, provided an improved classification procedure. Tests on eight public data sets showed the validity of the modified method. This modified k-TSP method was applied to our serum metabolomics data derived from liquid chromatography-mass spectrometry analysis of hepatocellular carcinoma and chronic liver diseases. Based on the 27 selected feature pairs, HCC and chronic liver diseases were accurately distinguished using the principal component analysis, and certain profound metabolic disturbances related to liver disease development were revealed by the feature pairs.
Collapse
Affiliation(s)
- Xiaohui Lin
- School of Computer Science & Technology, Dalian University of Technology, 116024 Dalian, China
| | - Jiuchong Gao
- School of Computer Science & Technology, Dalian University of Technology, 116024 Dalian, China
| | - Lina Zhou
- Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China
| | - Peiyuan Yin
- Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China
| | - Guowang Xu
- Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China.
| |
Collapse
|
144
|
Zhou W, Dickerson JA. A novel class dependent feature selection method for cancer biomarker discovery. Comput Biol Med 2014; 47:66-75. [DOI: 10.1016/j.compbiomed.2014.01.014] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2013] [Revised: 01/23/2014] [Accepted: 01/28/2014] [Indexed: 10/25/2022]
|
145
|
Wei X, Ai J, Deng Y, Guan X, Johnson DR, Ang CY, Zhang C, Perkins EJ. Identification of biomarkers that distinguish chemical contaminants based on gene expression profiles. BMC Genomics 2014; 15:248. [PMID: 24678894 PMCID: PMC4051169 DOI: 10.1186/1471-2164-15-248] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2013] [Accepted: 03/11/2014] [Indexed: 11/29/2022] Open
Abstract
Background High throughput transcriptomics profiles such as those generated using microarrays have been useful in identifying biomarkers for different classification and toxicity prediction purposes. Here, we investigated the use of microarrays to predict chemical toxicants and their possible mechanisms of action. Results In this study, in vitro cultures of primary rat hepatocytes were exposed to 105 chemicals and vehicle controls, representing 14 compound classes. We comprehensively compared various normalization of gene expression profiles, feature selection and classification algorithms for the classification of these 105 chemicals into14 compound classes. We found that normalization had little effect on the averaged classification accuracy. Two support vector machine (SVM) methods, LibSVM and sequential minimal optimization, had better classification performance than other methods. SVM recursive feature selection (SVM-RFE) had the highest overfitting rate when an independent dataset was used for a prediction. Therefore, we developed a new feature selection algorithm called gradient method that had a relatively high training classification as well as prediction accuracy with the lowest overfitting rate of the methods tested. Analysis of biomarkers that distinguished the 14 classes of compounds identified a group of genes principally involved in cell cycle function that were significantly downregulated by metal and inflammatory compounds, but were induced by anti-microbial, cancer related drugs, pesticides, and PXR mediators. Conclusions Our results indicate that using microarrays and a supervised machine learning approach to predict chemical toxicants, their potential toxicity and mechanisms of action is practical and efficient. Choosing the right feature and classification algorithms for this multiple category classification and prediction is critical.
Collapse
Affiliation(s)
| | | | - Youping Deng
- Department of Internal Medicine, Rush University Cancer Center, Rush University Medical Center, Kidston House, 630 S, Hermitage Ave, Room 408, Chicago, IL 60612, USA.
| | | | | | | | | | | |
Collapse
|
146
|
Buturovic L, Wong M, Tang GW, Altman RB, Petkovic D. High precision prediction of functional sites in protein structures. PLoS One 2014; 9:e91240. [PMID: 24632601 PMCID: PMC3954699 DOI: 10.1371/journal.pone.0091240] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2013] [Accepted: 02/11/2014] [Indexed: 11/29/2022] Open
Abstract
We address the problem of assigning biological function to solved protein structures. Computational tools play a critical role in identifying potential active sites and informing screening decisions for further lab analysis. A critical parameter in the practical application of computational methods is the precision, or positive predictive value. Precision measures the level of confidence the user should have in a particular computed functional assignment. Low precision annotations lead to futile laboratory investigations and waste scarce research resources. In this paper we describe an advanced version of the protein function annotation system FEATURE, which achieved 99% precision and average recall of 95% across 20 representative functional sites. The system uses a Support Vector Machine classifier operating on the microenvironment of physicochemical features around an amino acid. We also compared performance of our method with state-of-the-art sequence-level annotator Pfam in terms of precision, recall and localization. To our knowledge, no other functional site annotator has been rigorously evaluated against these key criteria. The software and predictive models are incorporated into the WebFEATURE service at http://feature.stanford.edu/wf4.0-beta.
Collapse
Affiliation(s)
- Ljubomir Buturovic
- Department of Computer Science, San Francisco State University, San Francisco, California, United States of America
- * E-mail:
| | - Mike Wong
- Center for Computing for Life Sciences, San Francisco State University, San Francisco, California, United States of America
| | - Grace W. Tang
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
| | - Russ B. Altman
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
| | - Dragutin Petkovic
- Department of Computer Science, San Francisco State University, San Francisco, California, United States of America
- Center for Computing for Life Sciences, San Francisco State University, San Francisco, California, United States of America
| |
Collapse
|
147
|
Aphinyanaphongs Y, Fu LD, Li Z, Peskin ER, Efstathiadis E, Aliferis CF, Statnikov A. A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. J Assoc Inf Sci Technol 2014. [DOI: 10.1002/asi.23110] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Affiliation(s)
- Yindalon Aphinyanaphongs
- Center for Health Informatics and Bioinformatics; New York University Langone Medical Center; 227 East 30th Street New York NY 10016
- Department of Medicine; New York University School of Medicine; 550 First Avenue New York NY 10016
| | - Lawrence D. Fu
- Center for Health Informatics and Bioinformatics; New York University Langone Medical Center; 227 East 30th Street New York NY 10016
- Department of Medicine; New York University School of Medicine; 550 First Avenue New York NY 10016
| | - Zhiguo Li
- Center for Health Informatics and Bioinformatics; New York University Langone Medical Center; 227 East 30th Street New York NY 10016
| | - Eric R. Peskin
- Center for Health Informatics and Bioinformatics; New York University Langone Medical Center; 227 East 30th Street New York NY 10016
| | - Efstratios Efstathiadis
- Center for Health Informatics and Bioinformatics; New York University Langone Medical Center; 227 East 30th Street New York NY 10016
| | - Constantin F. Aliferis
- Center for Health Informatics and Bioinformatics; New York University Langone Medical Center; 227 East 30th Street New York NY 10016
- Department of Pathology; New York University School of Medicine; 550 First Avenue New York NY 10016
- Department of Biostatistics; Vanderbilt University; 1211 Medical Center Drive Nashville TN 37232
| | - Alexander Statnikov
- Center for Health Informatics and Bioinformatics; New York University Langone Medical Center; 227 East 30th Street New York NY 10016
- Department of Medicine; New York University School of Medicine; 550 First Avenue New York NY 10016
| |
Collapse
|
148
|
Sparse representation for tumor classification based on feature extraction using latent low-rank representation. BIOMED RESEARCH INTERNATIONAL 2014; 2014:420856. [PMID: 24678505 PMCID: PMC3942202 DOI: 10.1155/2014/420856] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/13/2013] [Revised: 12/27/2013] [Accepted: 12/27/2013] [Indexed: 11/17/2022]
Abstract
Accurate tumor classification is crucial to the proper treatment of cancer. To now, sparse representation (SR) has shown its great performance for tumor classification. This paper conceives a new SR-based method for tumor classification by using gene expression data. In the proposed method, we firstly use latent low-rank representation for extracting salient features and removing noise from the original samples data. Then we use sparse representation classifier (SRC) to build tumor classification model. The experimental results on several real-world data sets show that our method is more efficient and more effective than the previous classification methods including SVM, SRC, and LASSO.
Collapse
|
149
|
Abstract
This study presents the compression of microscopic medical images by Support Vector Machines using machine learning. The visual cortex is the largest system in the human brain and is responsible for image processing such as compression, because the eye does not necessarily perceive all the details of an image. Medical images are a valuable means of decision support. However, they provide a large number of images per examination that can be transmitted over a network or stored for several years under the law imposed by the country. To apply the reasoning of biological intelligence, this study uses Support Vector Machines for compression to reduce the pixels of medical images in order to transmit data in less time and store information in less space. The results found by using this method are satisfactory for compression though the time must be improved.
Collapse
|
150
|
Anguita D, Ghio A, Oneto L, Ridella S. Unlabeled patterns to tighten Rademacher complexity error bounds for kernel classifiers. Pattern Recognit Lett 2014. [DOI: 10.1016/j.patrec.2013.04.027] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|