1
|
Hawkins NT, Maldaver M, Yannakopoulos A, Guare LA, Krishnan A. Systematic tissue annotations of genomics samples by modeling unstructured metadata. Nat Commun 2022; 13:6736. [PMID: 36347858 PMCID: PMC9643451 DOI: 10.1038/s41467-022-34435-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 10/25/2022] [Indexed: 11/10/2022] Open
Abstract
There are currently >1.3 million human -omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto .
Collapse
Affiliation(s)
- Nathaniel T Hawkins
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
| | - Marc Maldaver
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
| | - Anna Yannakopoulos
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
| | - Lindsay A Guare
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, 48824, USA
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA.
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA.
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
| |
Collapse
|
2
|
Mancuso CA, Canfield JL, Singla D, Krishnan A. A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes. Nucleic Acids Res 2020; 48:e125. [PMID: 33074331 PMCID: PMC7708069 DOI: 10.1093/nar/gkaa881] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Revised: 08/24/2020] [Accepted: 09/28/2020] [Indexed: 12/15/2022] Open
Abstract
While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96–570 and LINCS), and multiple imputation tasks (within and across microarray/RNA-seq datasets) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.
Collapse
Affiliation(s)
- Christopher A Mancuso
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Jacob L Canfield
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Deepak Singla
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Indian Institute of Technology, Delhi, India
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
3
|
Bernstein MN, Ma Z, Gleicher M, Dewey CN. CellO: comprehensive and hierarchical cell type classification of human cells with the Cell Ontology. iScience 2020; 24:101913. [PMID: 33364592 PMCID: PMC7753962 DOI: 10.1016/j.isci.2020.101913] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 10/28/2020] [Accepted: 12/02/2020] [Indexed: 12/15/2022] Open
Abstract
Cell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. In this work, we present CellO, a machine learning-based tool for annotating human RNA-seq data with the Cell Ontology. CellO enables accurate and standardized cell type classification of cell clusters by considering the rich hierarchical structure of known cell types. Furthermore, CellO comes pre-trained on a comprehensive data set of human, healthy, untreated primary samples in the Sequence Read Archive. CellO's comprehensive training set enables it to run out of the box on diverse cell types and achieves competitive or even superior performance when compared to existing state-of-the-art methods. Lastly, CellO's linear models are easily interpreted, thereby enabling exploration of cell-type-specific expression signatures across the ontology. To this end, we also present the CellO Viewer: a web application for exploring CellO's models across the ontology.
Collapse
Affiliation(s)
| | - Zhongjie Ma
- Department of Computer Sciences, University of Wisconsin - Madison, Madison, WI 53706, USA
| | - Michael Gleicher
- Department of Computer Sciences, University of Wisconsin - Madison, Madison, WI 53706, USA
| | - Colin N Dewey
- Department of Computer Sciences, University of Wisconsin - Madison, Madison, WI 53706, USA.,Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison, Madison, WI 53792, USA
| |
Collapse
|
4
|
Liu R, Mancuso CA, Yannakopoulos A, Johnson KA, Krishnan A. Supervised learning is an accurate method for network-based gene classification. Bioinformatics 2020; 36:3457-3465. [PMID: 32129827 PMCID: PMC7267831 DOI: 10.1093/bioinformatics/btaa150] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 12/01/2019] [Accepted: 02/27/2020] [Indexed: 12/22/2022] Open
Abstract
Background Assigning every human gene to specific functions, diseases and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods, such as supervised learning and label propagation, that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine-learning technique across fields, supervised learning has been applied only in a few network-based studies for predicting pathway-, phenotype- or disease-associated genes. It is unknown how supervised learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label propagation, the widely benchmarked canonical approach for this problem. Results In this study, we present a comprehensive benchmarking of supervised learning for network-based gene classification, evaluating this approach and a classic label propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised learning on a gene’s full network connectivity outperforms label propagaton and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label propagation’s appeal for naturally using network topology. We further show that supervised learning on the full network is also superior to learning on node embeddings (derived using node2vec), an increasingly popular approach for concisely representing network connectivity. These results show that supervised learning is an accurate approach for prioritizing genes associated with diverse functions, diseases and traits and should be considered a staple of network-based gene classification workflows. Availability and implementation The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available. Contact arjun@msu.edu Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Renming Liu
- Department of Computational Mathematics, Science and Engineering
| | | | | | - Kayla A Johnson
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
- To whom correspondence should be addressed.
| |
Collapse
|
5
|
Liu R, Mancuso CA, Yannakopoulos A, Johnson KA, Krishnan A. Supervised learning is an accurate method for network-based gene classification. BIOINFORMATICS (OXFORD, ENGLAND) 2020; 36:3457-3465. [PMID: 32129827 DOI: 10.1101/721423] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 12/01/2019] [Accepted: 02/27/2020] [Indexed: 05/26/2023]
Abstract
BACKGROUND Assigning every human gene to specific functions, diseases and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods, such as supervised learning and label propagation, that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine-learning technique across fields, supervised learning has been applied only in a few network-based studies for predicting pathway-, phenotype- or disease-associated genes. It is unknown how supervised learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label propagation, the widely benchmarked canonical approach for this problem. RESULTS In this study, we present a comprehensive benchmarking of supervised learning for network-based gene classification, evaluating this approach and a classic label propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised learning on a gene's full network connectivity outperforms label propagaton and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label propagation's appeal for naturally using network topology. We further show that supervised learning on the full network is also superior to learning on node embeddings (derived using node2vec), an increasingly popular approach for concisely representing network connectivity. These results show that supervised learning is an accurate approach for prioritizing genes associated with diverse functions, diseases and traits and should be considered a staple of network-based gene classification workflows. AVAILABILITY AND IMPLEMENTATION The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available. CONTACT arjun@msu.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Renming Liu
- Department of Computational Mathematics, Science and Engineering
| | | | | | - Kayla A Johnson
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
6
|
Wang Z, Lachmann A, Ma'ayan A. Mining data and metadata from the gene expression omnibus. Biophys Rev 2019; 11:103-110. [PMID: 30594974 PMCID: PMC6381352 DOI: 10.1007/s12551-018-0490-8] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 12/04/2018] [Indexed: 12/16/2022] Open
Abstract
Publicly available gene expression datasets deposited in the Gene Expression Omnibus (GEO) are growing at an accelerating rate. Such datasets hold great value for knowledge discovery, particularly when integrated. Although numerous software platforms and tools have been developed to enable reanalysis and integration of individual, or groups, of GEO datasets, large-scale reuse of those datasets is impeded by minimal requirements for standardized metadata both at the study and sample levels as well as uniform processing of the data across studies. Here, we review methodologies developed to facilitate the systematic curation and processing of publicly available gene expression datasets from GEO. We identify trends for advanced metadata curation and summarize approaches for reprocessing the data within the entire GEO repository.
Collapse
Affiliation(s)
- Zichen Wang
- BD2K-LINCS Data Coordination and Integration Center; Knowledge Management Center for the Illuminating the Druggable Genome; Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, Box 1603, One Gustave L. Levy Place, New York, NY, 10029, USA.
| | - Alexander Lachmann
- BD2K-LINCS Data Coordination and Integration Center; Knowledge Management Center for the Illuminating the Druggable Genome; Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, Box 1603, One Gustave L. Levy Place, New York, NY, 10029, USA
| | - Avi Ma'ayan
- BD2K-LINCS Data Coordination and Integration Center; Knowledge Management Center for the Illuminating the Druggable Genome; Mount Sinai Center for Bioinformatics, Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, Box 1603, One Gustave L. Levy Place, New York, NY, 10029, USA
| |
Collapse
|
7
|
Lee YS, Krishnan A, Oughtred R, Rust J, Chang CS, Ryu J, Kristensen VN, Dolinski K, Theesfeld CL, Troyanskaya OG. A Computational Framework for Genome-wide Characterization of the Human Disease Landscape. Cell Syst 2019; 8:152-162.e6. [PMID: 30685436 PMCID: PMC7374759 DOI: 10.1016/j.cels.2018.12.010] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Revised: 10/16/2018] [Accepted: 12/20/2018] [Indexed: 01/21/2023]
Abstract
A key challenge for the diagnosis and treatment of complex human diseases is identifying their molecular basis. Here, we developed a unified computational framework, URSAHD (Unveiling RNA Sample Annotation for Human Diseases), that leverages machine learning and the hierarchy of anatomical relationships present among diseases to integrate thousands of clinical gene expression profiles and identify molecular characteristics specific to each of the hundreds of complex diseases. URSAHD can distinguish between closely related diseases more accurately than literature-validated genes or traditional differential-expression-based computational approaches and is applicable to any disease, including rare and understudied ones. We demonstrate the utility of URSAHD in classifying related nervous system cancers and experimentally verifying novel neuroblastoma-associated genes identified by URSAHD. We highlight the applications for potential targeted drug-repurposing and for quantitatively assessing the molecular response to clinical therapies. URSAHD is freely available for public use, including the use of underlying models, at ursahd.princeton.edu.
Collapse
Affiliation(s)
- Young-Suk Lee
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA; Department of Computer Science, Princeton University, Princeton, NJ, USA; School of Biological Sciences, Seoul National University, Seoul, South Korea
| | - Arjun Krishnan
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA; Departments of Computational Mathematics, Science, and Engineering and Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Christie S Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Joseph Ryu
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Vessela N Kristensen
- Department of Genetics, Institute of Cancer Research, Oslo University Hospital, Radiumhospitalet, Oslo, Norway; Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway; Department of Clinical Molecular Biology (EpiGen), Division of Medicine, Akershus University Hospital, Lørenskog, Norway
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Chandra L Theesfeld
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA.
| | - Olga G Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA; Department of Computer Science, Princeton University, Princeton, NJ, USA; Flatiron Institute, Simons Foundation, New York, NY, USA.
| |
Collapse
|
8
|
Giles CB, Brown CA, Ripperger M, Dennis Z, Roopnarinesingh X, Porter H, Perz A, Wren JD. ALE: automated label extraction from GEO metadata. BMC Bioinformatics 2017; 18:509. [PMID: 29297276 PMCID: PMC5751806 DOI: 10.1186/s12859-017-1888-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background NCBI’s Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning labels to the experiments. Automated approaches are preferable for this, primarily because of the size and volume of the data to be processed, but also because it ensures standardization and consistency. While some of these labels can be extracted directly from the textual metadata, many of the data available do not contain explicit text informing the researcher about the age and gender of the subjects with the study. To bridge this gap, machine-learning methods can be trained to use the gene expression patterns associated with the text-derived labels to refine label-prediction confidence. Results Our analysis shows only 26% of metadata text contains information about gender and 21% about age. In order to ameliorate the lack of available labels for these data sets, we first extract labels from the textual metadata for each GEO RNA dataset and evaluate the performance against a gold standard of manually curated labels. We then use machine-learning methods to predict labels, based upon gene expression of the samples and compare this to the text-based method. Conclusion Here we present an automated method to extract labels for age, gender, and tissue from textual metadata and GEO data using both a heuristic approach as well as machine learning. We show the two methods together improve accuracy of label assignment to GEO samples.
Collapse
Affiliation(s)
- Cory B Giles
- Arthritis & Clinical Immunology Program, Oklahoma Medical Research Foundation, 825 N.E. 13th Street, Oklahoma City, OK, 73104, USA.,Department of Biochemistry and Molecular Biology, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
| | - Chase A Brown
- Arthritis & Clinical Immunology Program, Oklahoma Medical Research Foundation, 825 N.E. 13th Street, Oklahoma City, OK, 73104, USA
| | | | - Zane Dennis
- Department of Computer Science, Baylor University, Hankamer Academic Building, 105 Baylor Ave, Waco, TX, 76706, USA
| | - Xiavan Roopnarinesingh
- Arthritis & Clinical Immunology Program, Oklahoma Medical Research Foundation, 825 N.E. 13th Street, Oklahoma City, OK, 73104, USA
| | - Hunter Porter
- Arthritis & Clinical Immunology Program, Oklahoma Medical Research Foundation, 825 N.E. 13th Street, Oklahoma City, OK, 73104, USA
| | - Aleksandra Perz
- Arthritis & Clinical Immunology Program, Oklahoma Medical Research Foundation, 825 N.E. 13th Street, Oklahoma City, OK, 73104, USA
| | - Jonathan D Wren
- Arthritis & Clinical Immunology Program, Oklahoma Medical Research Foundation, 825 N.E. 13th Street, Oklahoma City, OK, 73104, USA. .,Department of Biochemistry and Molecular Biology, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.
| |
Collapse
|
9
|
Schomburg I, Jeske L, Ulbrich M, Placzek S, Chang A, Schomburg D. The BRENDA enzyme information system–From a database to an expert system. J Biotechnol 2017; 261:194-206. [DOI: 10.1016/j.jbiotec.2017.04.020] [Citation(s) in RCA: 116] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2017] [Revised: 04/11/2017] [Accepted: 04/18/2017] [Indexed: 02/06/2023]
|
10
|
Abstract
Cross-species comparisons of genomes, transcriptomes and gene regulation are now feasible at unprecedented resolution and throughput, enabling the comparison of human and mouse biology at the molecular level. Insights have been gained into the degree of conservation between human and mouse at the level of not only gene expression but also epigenetics and inter-individual variation. However, a number of limitations exist, including incomplete transcriptome characterization and difficulties in identifying orthologous phenotypes and cell types, which are beginning to be addressed by emerging technologies. Ultimately, these comparisons will help to identify the conditions under which the mouse is a suitable model of human physiology and disease, and optimize the use of animal models.
Collapse
Affiliation(s)
- Alessandra Breschi
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), 08002 Barcelona, Spain
| | - Thomas R Gingeras
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11742, USA
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), 08002 Barcelona, Spain
| |
Collapse
|
11
|
Amar D, Izraeli S, Shamir R. Utilizing somatic mutation data from numerous studies for cancer research: proof of concept and applications. Oncogene 2017; 36:3375-3383. [PMID: 28092680 PMCID: PMC5485176 DOI: 10.1038/onc.2016.489] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2016] [Revised: 11/20/2016] [Accepted: 11/22/2016] [Indexed: 02/07/2023]
Abstract
Large cancer projects measure somatic mutations in thousands of samples, gradually assembling a catalog of recurring mutations in cancer. Many methods analyze these data jointly with auxiliary information with the aim of identifying subtype-specific results. Here, we show that somatic gene mutations alone can reliably and specifically predict cancer subtypes. Interpretation of the classifiers provides useful insights for several biomedical applications. We analyze the COSMIC database, which collects somatic mutations from The Cancer Genome Atlas (TCGA) as well as from many smaller scale studies. We use multi-label classification techniques and the Disease Ontology hierarchy in order to identify cancer subtype-specific biomarkers. Cancer subtype classifiers based on TCGA and the smaller studies have comparable performance, and the smaller studies add a substantial value in terms of validation, coverage of additional subtypes, and improved classification. The gene sets of the classifiers are used for threefold contribution. First, we refine the associations of genes to cancer subtypes and identify novel compelling candidate driver genes. Second, using our classifiers we successfully predict the primary site of metastatic samples. Third, we provide novel hypotheses regarding detection of subtype-specific synthetic lethality interactions. From the cancer research community perspective, our results suggest that curation efforts, such as COSMIC, have great added and complementary value even in the era of large international cancer projects.
Collapse
Affiliation(s)
- D Amar
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - S Izraeli
- Department of Pediatric Hematology-Oncology, Safra Children’s Hospital, Sheba Medical Center, Tel Hashomer, Ramat Gan, Israel
- Sackler School of Medicine, Tel Aviv University, Tel-Aviv, Israel
| | - R Shamir
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
12
|
Yang G, Hu Z. Gene Feature Extraction Based on Nonnegative Dual Graph Regularized Latent Low-Rank Representation. BIOMED RESEARCH INTERNATIONAL 2017; 2017:1096028. [PMID: 28466003 PMCID: PMC5390636 DOI: 10.1155/2017/1096028] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2017] [Accepted: 03/13/2017] [Indexed: 01/16/2023]
Abstract
Aiming at the problem of gene expression profile's high redundancy and heavy noise, a new feature extraction model based on nonnegative dual graph regularized latent low-rank representation (NNDGLLRR) is presented on the basis of latent low-rank representation (Lat-LRR). By introducing dual graph manifold regularized constraint, the NNDGLLRR can keep the internal spatial structure of the original data effectively and improve the final clustering accuracy while segmenting the subspace. The introduction of nonnegative constraints makes the computation with some sparsity, which enhances the robustness of the algorithm. Different from Lat-LRR, a new solution model is adopted to simplify the computational complexity. The experimental results show that the proposed algorithm has good feature extraction performance for the heavy redundancy and noise gene expression profile, which, compared with LRR and Lat-LRR, can achieve better clustering accuracy.
Collapse
Affiliation(s)
- Guoliang Yang
- School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China
| | - Zhengwei Hu
- School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China
| |
Collapse
|
13
|
Angeles-Albores D, N Lee RY, Chan J, Sternberg PW. Tissue enrichment analysis for C. elegans genomics. BMC Bioinformatics 2016; 17:366. [PMID: 27618863 PMCID: PMC5020436 DOI: 10.1186/s12859-016-1229-9] [Citation(s) in RCA: 110] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2016] [Accepted: 08/26/2016] [Indexed: 01/04/2023] Open
Abstract
Background Over the last ten years, there has been explosive development in methods for measuring gene expression. These methods can identify thousands of genes altered between conditions, but understanding these datasets and forming hypotheses based on them remains challenging. One way to analyze these datasets is to associate ontologies (hierarchical, descriptive vocabularies with controlled relations between terms) with genes and to look for enrichment of specific terms. Although Gene Ontology (GO) is available for Caenorhabditis elegans, it does not include anatomical information. Results We have developed a tool for identifying enrichment of C. elegans tissues among gene sets and generated a website GUI where users can access this tool. Since a common drawback to ontology enrichment analyses is its verbosity, we developed a very simple filtering algorithm to reduce the ontology size by an order of magnitude. We adjusted these filters and validated our tool using a set of 30 gold standards from Expression Cluster data in WormBase. We show our tool can even discriminate between embryonic and larval tissues and can even identify tissues down to the single-cell level. We used our tool to identify multiple neuronal tissues that are down-regulated due to pathogen infection in C. elegans. Conclusions Our Tissue Enrichment Analysis (TEA) can be found within WormBase, and can be downloaded using Python’s standard pip installer. It tests a slimmed-down C. elegans tissue ontology for enrichment of specific terms and provides users with a text and graphic representation of the results. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1229-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- David Angeles-Albores
- HHMI and California Institute of Technology, Division of Biology and Biological Engineering, 1200 E California Blvd, Pasadena, 91125, USA
| | - Raymond Y N Lee
- HHMI and California Institute of Technology, Division of Biology and Biological Engineering, 1200 E California Blvd, Pasadena, 91125, USA
| | - Juancarlos Chan
- HHMI and California Institute of Technology, Division of Biology and Biological Engineering, 1200 E California Blvd, Pasadena, 91125, USA
| | - Paul W Sternberg
- HHMI and California Institute of Technology, Division of Biology and Biological Engineering, 1200 E California Blvd, Pasadena, 91125, USA.
| |
Collapse
|
14
|
He F, Yoo S, Wang D, Kumari S, Gerstein M, Ware D, Maslov S. Large-scale atlas of microarray data reveals the distinct expression landscape of different tissues in Arabidopsis. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2016; 86:472-480. [PMID: 27015116 DOI: 10.1111/tpj.13175] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/16/2015] [Revised: 02/24/2016] [Accepted: 03/21/2016] [Indexed: 06/05/2023]
Abstract
Transcriptome data sets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by a lack of metadata or differences in annotation styles of different labs. In this study, we carefully selected and integrated 6057 Arabidopsis microarray expression samples from 304 experiments deposited to the Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI). Metadata such as tissue type, growth conditions and developmental stage were manually curated for each sample. We then studied the global expression landscape of the integrated data set and found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome, compared with aerial tissues, but the transcriptome of cultured root is more similar to the transcriptome of aerial tissues, as the cultured root samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating the re-use of plant transcriptome data. As a proof of principle, we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified the accuracy of our predictions with sample metadata provided by the authors.
Collapse
Affiliation(s)
- Fei He
- Biology Department, Brookhaven National Laboratory, Upton, NY, 11973, USA
| | - Shinjae Yoo
- Computational Science Center, Brookhaven National Laboratory, Upton, NY, 11973, USA
- Institute of Advanced Computational Science at Stony Brook University, Stony Brook, NY, 11794, USA
| | - Daifeng Wang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA
| | - Sunita Kumari
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 17724, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA
| | - Doreen Ware
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 17724, USA
- USDA ARS NEA Plant, Soil & Nutrition Laboratory Research Unit, USDA-ARS, Ithaca, NY, 14853, USA
| | - Sergei Maslov
- Biology Department, Brookhaven National Laboratory, Upton, NY, 11973, USA
- Department of Bioengineering, Carl R. Woese Institute for Genomic Biology, Urbana, IL, 61801, USA
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| |
Collapse
|
15
|
Li WV, Razaee ZS, Li JJ. Epigenome overlap measure (EPOM) for comparing tissue/cell types based on chromatin states. BMC Genomics 2016; 17 Suppl 1:10. [PMID: 26817822 PMCID: PMC4895267 DOI: 10.1186/s12864-015-2303-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Background The dynamics of epigenomic marks in their relevant chromatin states regulate distinct gene expression patterns, biological functions and phenotypic variations in biological processes. The availability of high-throughput epigenomic data generated by next-generation sequencing technologies allows a data-driven approach to evaluate the similarities and differences of diverse tissue and cell types in terms of epigenomic features. While ChromImpute has allowed for the imputation of large-scale epigenomic information to yield more robust data to capture meaningful relationships between biological samples, widely used methods such as hierarchical clustering and correlation analysis cannot adequately utilize epigenomic data to accurately reveal the distinction and grouping of different tissue and cell types. Methods We utilize a three-step testing procedure–ANOVA, t test and overlap test to identify tissue/cell-type- associated enhancers and promoters and to calculate a newly defined Epigenomic Overlap Measure (EPOM). EPOM results in a clear correspondence map of biological samples from different tissue and cell types through comparison of epigenomic marks evaluated in their relevant chromatin states. Results Correspondence maps by EPOM show strong capability in distinguishing and grouping different tissue and cell types and reveal biologically meaningful similarities between Heart and Muscle, Blood & T-cell and HSC & B-cell, Brain and Neurosphere, etc. The gene ontology enrichment analysis both supports and explains the discoveries made by EPOM and suggests that the associated enhancers and promoters demonstrate distinguishable functions across tissue and cell types. Moreover, the tissue/cell-type-associated enhancers and promoters show enrichment in the disease-related SNPs that are also associated with the corresponding tissue or cell types. This agreement suggests the potential of identifying causal genetic variants relevant to cell-type-specific diseases from our identified associated enhancers and promoters. Conclusions The proposed EPOM measure demonstrates superior capability in grouping and finding a clear correspondence map of biological samples from different tissue and cell types. The identified associated enhancers and promoters provide a comprehensive catalog to study distinct biological processes and disease variants in different tissue and cell types. Our results also find that the associated promoters exhibit more cell-type-specific functions than the associated enhancers do, suggesting that the non-associated promoters have more housekeeping functions than the non-associated enhancers. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2303-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wei Vivian Li
- Department of Statistics, 8125 Math Sciences Bldg., University of California, Los Angeles, CA, 90095-1554, USA.
| | - Zahra S Razaee
- Department of Statistics, 8125 Math Sciences Bldg., University of California, Los Angeles, CA, 90095-1554, USA.
| | - Jingyi Jessica Li
- Department of Statistics, 8125 Math Sciences Bldg., University of California, Los Angeles, CA, 90095-1554, USA. .,Department of Human Genetics, University of California, Los Angeles, CA, 90095-7088, USA.
| |
Collapse
|
16
|
SOKOLOV ARTEM, PAULL EVANO, STUART JOSHUAM. ONE-CLASS DETECTION OF CELL STATES IN TUMOR SUBTYPES. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2016; 21:405-16. [PMID: 26776204 PMCID: PMC4856035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The cellular composition of a tumor greatly influences the growth, spread, immune activity, drug response, and other aspects of the disease. Tumor cells are usually comprised of a heterogeneous mixture of subclones, each of which could contain their own distinct character. The presence of minor subclones poses a serious health risk for patients as any one of them could harbor a fitness advantage with respect to the current treatment regimen, fueling resistance. It is therefore vital to accurately assess the make-up of cell states within a tumor biopsy. Transcriptome-wide assays from RNA sequencing provide key data from which cell state signatures can be detected. However, the challenge is to find them within samples containing mixtures of cell types of unknown proportions. We propose a novel one-class method based on logistic regression and show that its performance is competitive to two established SVM-based methods for this detection task. We demonstrate that one-class models are able to identify specific cell types in heterogeneous cell populations better than their binary predictor counterparts. We derive one-class predictors for the major breast and bladder subtypes and reaffirm the connection between these two tissues. In addition, we use a one-class predictor to quantitatively associate an embryonic stem cell signature with an aggressive breast cancer subtype that reveals shared stemness pathways potentially important for treatment.
Collapse
Affiliation(s)
- ARTEM SOKOLOV
- Department of Biomolecular Engineering, University of California Santa Cruz
| | - EVAN O. PAULL
- Department of Biomolecular Engineering, University of California Santa Cruz
| | - JOSHUA M. STUART
- Department of Biomolecular Engineering, University of California Santa Cruz
| |
Collapse
|
17
|
Ramírez-Gordillo D, Powers TR, van Velkinburgh JC, Trujillo-Provencio C, Schilkey F, Serrano EE. RNA-Seq and microarray analysis of the Xenopus inner ear transcriptome discloses orthologous OMIM(®) genes for hereditary disorders of hearing and balance. BMC Res Notes 2015; 8:691. [PMID: 26582541 PMCID: PMC4652436 DOI: 10.1186/s13104-015-1485-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2015] [Accepted: 09/21/2015] [Indexed: 12/14/2022] Open
Abstract
Background Auditory and vestibular disorders are prevalent sensory disabilities caused by genetic and environmental (noise, trauma, chemicals) factors that often damage mechanosensory hair cells of the inner ear. Development of treatments for inner ear disorders of hearing and balance relies on the use of animal models such as fish, amphibians, reptiles, birds, and non-human mammals. Here, we aimed to augment the utility of the genus Xenopus for uncovering genetic mechanisms essential for the maintenance of inner ear structure and function. Results Using Affymetrix GeneChip®X. laevis Genome 2.0 Arrays and Illumina-Solexa sequencing methods, we determined that the transcriptional profile of the Xenopuslaevis inner ear comprises hundreds of genes that are orthologous to OMIM® genes implicated in deafness and vestibular disorders in humans. Analysis of genes that mapped to both technologies demonstrated that, with our methods, a combination of microarray and RNA-Seq detected expression of more genes than either platform alone. Conclusions As part of this study we identified candidate scaffold regions of the Xenopus tropicalis genome that can be used to investigate hearing and balance using genetic and informatics procedures that are available through the National Xenopus Resource (NXR), and the open access data repository, Xenbase. The results and approaches presented here expand the viability of Xenopus as an animal model for inner ear research. Electronic supplementary material The online version of this article (doi:10.1186/s13104-015-1485-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - TuShun R Powers
- Biology Department, New Mexico State University (NMSU), Las Cruces, NM, 88003, USA.
| | | | | | - Faye Schilkey
- National Center for Genome Resources (NCGR), Santa Fe, NM, 87505, USA.
| | - Elba E Serrano
- Biology Department, New Mexico State University (NMSU), Las Cruces, NM, 88003, USA.
| |
Collapse
|
18
|
Amar D, Hait T, Izraeli S, Shamir R. Integrated analysis of numerous heterogeneous gene expression profiles for detecting robust disease-specific biomarkers and proposing drug targets. Nucleic Acids Res 2015; 43:7779-89. [PMID: 26261215 PMCID: PMC4652780 DOI: 10.1093/nar/gkv810] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2015] [Revised: 07/23/2015] [Accepted: 07/29/2015] [Indexed: 12/18/2022] Open
Abstract
Genome-wide expression profiling has revolutionized biomedical research; vast amounts of expression data from numerous studies of many diseases are now available. Making the best use of this resource in order to better understand disease processes and treatment remains an open challenge. In particular, disease biomarkers detected in case-control studies suffer from low reliability and are only weakly reproducible. Here, we present a systematic integrative analysis methodology to overcome these shortcomings. We assembled and manually curated more than 14,000 expression profiles spanning 48 diseases and 18 expression platforms. We show that when studying a particular disease, judicious utilization of profiles from other diseases and information on disease hierarchy improves classification quality, avoids overoptimistic evaluation of that quality, and enhances disease-specific biomarker discovery. This approach yielded specific biomarkers for 24 of the analyzed diseases. We demonstrate how to combine these biomarkers with large-scale interaction, mutation and drug target data, forming a highly valuable disease summary that suggests novel directions in disease understanding and drug repurposing. Our analysis also estimates the number of samples required to reach a desired level of biomarker stability. This methodology can greatly improve the exploitation of the mountain of expression profiles for better disease analysis.
Collapse
Affiliation(s)
- David Amar
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
| | - Tom Hait
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
| | - Shai Izraeli
- Department of Pediatric Hematology-Oncology, Safra Children's Hospital, Sheba Medical Center, Tel Hashomer, Ramat Gan 52620, Israel Sackler School of Medicine, Tel-Aviv University, Tel Aviv 69978, Israel
| | - Ron Shamir
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
| |
Collapse
|
19
|
Kim M, Zorraquino V, Tagkopoulos I. Microbial forensics: predicting phenotypic characteristics and environmental conditions from large-scale gene expression profiles. PLoS Comput Biol 2015; 11:e1004127. [PMID: 25774498 PMCID: PMC4361189 DOI: 10.1371/journal.pcbi.1004127] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2014] [Accepted: 01/14/2015] [Indexed: 01/13/2023] Open
Abstract
A tantalizing question in cellular physiology is whether the cellular state and environmental conditions can be inferred by the expression signature of an organism. To investigate this relationship, we created an extensive normalized gene expression compendium for the bacterium Escherichia coli that was further enriched with meta-information through an iterative learning procedure. We then constructed an ensemble method to predict environmental and cellular state, including strain, growth phase, medium, oxygen level, antibiotic and carbon source presence. Results show that gene expression is an excellent predictor of environmental structure, with multi-class ensemble models achieving balanced accuracy between 70.0% (±3.5%) to 98.3% (±2.3%) for the various characteristics. Interestingly, this performance can be significantly boosted when environmental and strain characteristics are simultaneously considered, as a composite classifier that captures the inter-dependencies of three characteristics (medium, phase and strain) achieved 10.6% (±1.0%) higher performance than any individual models. Contrary to expectations, only 59% of the top informative genes were also identified as differentially expressed under the respective conditions. Functional analysis of the respective genetic signatures implicates a wide spectrum of Gene Ontology terms and KEGG pathways with condition-specific information content, including iron transport, transferases, and enterobactin synthesis. Further experimental phenotypic-to-genotypic mapping that we conducted for knock-out mutants argues for the information content of top-ranked genes. This work demonstrates the degree at which genome-scale transcriptional information can be predictive of latent, heterogeneous and seemingly disparate phenotypic and environmental characteristics, with far-reaching applications. The transcriptional profile of an organism contains clues about the environmental context in which it has evolved and currently lives, its behavior and cellular state. It is yet unclear, however, how much information can be efficiently extracted and how it can be used to classify new samples with respect to their environmental and genetic characteristics. Here, we have constructed an extensive transcriptome compendium of Escherichia coli that we have further enriched via an iterative learning approach. We then apply an ensemble of various machine learning algorithms to infer environmental and cellular information such as strain, growth phase, medium, oxygen level, antibiotic and carbon source. Functional analysis of the most informative genes provides mechanistic insights and palpable hypotheses regarding their role in each environmental or genetic context. Our work argues that genome-scale gene expression can be a multi-purpose marker for identifying latent, heterogeneous cellular and environmental states and that optimal classification can be achieved with a feature set of a couple hundred genes that might not necessarily have the most pronounced differential expression in the respective conditions.
Collapse
Affiliation(s)
- Minseung Kim
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- UC Davis Genome Center, University of California, Davis, Davis, California, United States of America
| | - Violeta Zorraquino
- UC Davis Genome Center, University of California, Davis, Davis, California, United States of America
| | - Ilias Tagkopoulos
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- UC Davis Genome Center, University of California, Davis, Davis, California, United States of America
- * E-mail:
| |
Collapse
|
20
|
Sebestyén E, Zawisza M, Eyras E. Detection of recurrent alternative splicing switches in tumor samples reveals novel signatures of cancer. Nucleic Acids Res 2015; 43:1345-56. [PMID: 25578962 PMCID: PMC4330360 DOI: 10.1093/nar/gku1392] [Citation(s) in RCA: 137] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
The determination of the alternative splicing isoforms expressed in cancer is fundamental for the development of tumor-specific molecular targets for prognosis and therapy, but it is hindered by the heterogeneity of tumors and the variability across patients. We developed a new computational method, robust to biological and technical variability, which identifies significant transcript isoform changes across multiple samples. We applied this method to more than 4000 samples from the The Cancer Genome Atlas project to obtain novel splicing signatures that are predictive for nine different cancer types, and find a specific signature for basal-like breast tumors involving the tumor-driver CTNND1. Additionally, our method identifies 244 isoform switches, for which the change occurs in the most abundant transcript. Some of these switches occur in known tumor drivers, including PPARG, CCND3, RALGDS, MITF, PRDM1, ABI1 and MYH11, for which the switch implies a change in the protein product. Moreover, some of the switches cannot be described with simple splicing events. Surprisingly, isoform switches are independent of somatic mutations, except for the tumor-suppressor FBLN2 and the oncogene MYH11. Our method reveals novel signatures of cancer in terms of transcript isoforms specifically expressed in tumors, providing novel potential molecular targets for prognosis and therapy. Data and software are available at: http://dx.doi.org/10.6084/m9.figshare.1061917 and https://bitbucket.org/regulatorygenomicsupf/iso-ktsp.
Collapse
Affiliation(s)
- Endre Sebestyén
- Computational Genomics, Universitat Pompeu Fabra, Dr. Aiguader 88, E08003 Barcelona, Spain
| | - Michał Zawisza
- Universitat Politècnica de Catalunya, Jordi Girona 1-3, E08034 Barcelona, Spain
| | - Eduardo Eyras
- Computational Genomics, Universitat Pompeu Fabra, Dr. Aiguader 88, E08003 Barcelona, Spain Catalan Institution for Research and Advanced Studies, Passeig Lluís Companys 23, E08010 Barcelona, Spain
| |
Collapse
|
21
|
Sparse representation for tumor classification based on feature extraction using latent low-rank representation. BIOMED RESEARCH INTERNATIONAL 2014; 2014:420856. [PMID: 24678505 PMCID: PMC3942202 DOI: 10.1155/2014/420856] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/13/2013] [Revised: 12/27/2013] [Accepted: 12/27/2013] [Indexed: 11/17/2022]
Abstract
Accurate tumor classification is crucial to the proper treatment of cancer. To now, sparse representation (SR) has shown its great performance for tumor classification. This paper conceives a new SR-based method for tumor classification by using gene expression data. In the proposed method, we firstly use latent low-rank representation for extracting salient features and removing noise from the original samples data. Then we use sparse representation classifier (SRC) to build tumor classification model. The experimental results on several real-world data sets show that our method is more efficient and more effective than the previous classification methods including SVM, SRC, and LASSO.
Collapse
|