51
|
Zhang X, Zhou Z, Xu H, Liu CT. Integrative clustering methods for multi-omics data. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2022; 14. [PMID: 35573155 PMCID: PMC9097984 DOI: 10.1002/wics.1553] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Integrative analysis of multi-omics data has drawn much attention from the scientific community due to the technological advancements which have generated various omics data. Leveraging these multi-omics data potentially provides a more comprehensive view of the disease mechanism or biological processes. Integrative multi-omics clustering is an unsupervised integrative method specifically used to find coherent groups of samples or features by utilizing information across multi-omics data. It aims to better stratify diseases and to suggest biological mechanisms and potential targeted therapies for the diseases. However, applying integrative multi-omics clustering is both statistically and computationally challenging due to various reasons such as high dimensionality and heterogeneity. In this review, we summarized integrative multi-omics clustering methods into three general categories: concatenated clustering, clustering of clusters, and interactive clustering based on when and how the multi-omics data are processed for clustering. We further classified the methods into different approaches under each category based on the main statistical strategy used during clustering. In addition, we have provided recommended practices tailored to four real-life scenarios to help researchers to strategize their selection in integrative multi-omics clustering methods for their future studies.
Collapse
Affiliation(s)
- Xiaoyu Zhang
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Zhenwei Zhou
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Hanfei Xu
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Ching-Ti Liu
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
52
|
Yuan D, Gaynanova I. Double-matched matrix decomposition for multi-view data. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2067860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
53
|
Vahabi N, Michailidis G. Unsupervised Multi-Omics Data Integration Methods: A Comprehensive Review. Front Genet 2022; 13:854752. [PMID: 35391796 PMCID: PMC8981526 DOI: 10.3389/fgene.2022.854752] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 02/28/2022] [Indexed: 12/26/2022] Open
Abstract
Through the developments of Omics technologies and dissemination of large-scale datasets, such as those from The Cancer Genome Atlas, Alzheimer’s Disease Neuroimaging Initiative, and Genotype-Tissue Expression, it is becoming increasingly possible to study complex biological processes and disease mechanisms more holistically. However, to obtain a comprehensive view of these complex systems, it is crucial to integrate data across various Omics modalities, and also leverage external knowledge available in biological databases. This review aims to provide an overview of multi-Omics data integration methods with different statistical approaches, focusing on unsupervised learning tasks, including disease onset prediction, biomarker discovery, disease subtyping, module discovery, and network/pathway analysis. We also briefly review feature selection methods, multi-Omics data sets, and resources/tools that constitute critical components for carrying out the integration.
Collapse
Affiliation(s)
- Nasim Vahabi
- Informatics Institute, University of Florida, Gainesville, FL, United States
| | - George Michailidis
- Informatics Institute, University of Florida, Gainesville, FL, United States
| |
Collapse
|
54
|
Qian K, Fu S, Li H, Li WV. scINSIGHT for interpreting single-cell gene expression from biologically heterogeneous data. Genome Biol 2022; 23:82. [PMID: 35313930 PMCID: PMC8935111 DOI: 10.1186/s13059-022-02649-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Accepted: 03/07/2022] [Indexed: 12/30/2022] Open
Abstract
The increasing number of scRNA-seq data emphasizes the need for integrative analysis to interpret similarities and differences between single-cell samples. Although different batch effect removal methods have been developed, none are suitable for heterogeneous single-cell samples coming from multiple biological conditions. We propose a method, scINSIGHT, to learn coordinated gene expression patterns that are common among, or specific to, different biological conditions, and identify cellular identities and processes across single-cell samples. We compare scINSIGHT with state-of-the-art methods using simulated and real data, which demonstrate its improved performance. Our results show the applicability of scINSIGHT in diverse biomedical and clinical problems.
Collapse
Affiliation(s)
- Kun Qian
- School of Mathematics and Physics, China University of Geosciences, Wuhan, 430074, Hubei, China
| | - Shiwei Fu
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, 08854, NJ, USA
| | - Hongwei Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan, 430074, Hubei, China
| | - Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, 08854, NJ, USA.
| |
Collapse
|
55
|
Lock EF, Park JY, Hoadley KA. BIDIMENSIONAL LINKED MATRIX FACTORIZATION FOR PAN-OMICS PAN-CANCER ANALYSIS. Ann Appl Stat 2022; 16:193-215. [PMID: 35505906 PMCID: PMC9060567 DOI: 10.1214/21-aoas1495] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/19/2023]
Abstract
Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, pan-omics pan-cancer analysis, have extended our knowledge of molecular heterogeneity beyond what was observed in single tumor and single platform studies. However, these studies have been limited by available statistical methodology. We propose a flexible approach to the simultaneous factorization and decomposition of variation across such bidimensionally linked matrices, BIDIFAC+. BIDIFAC+ decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). This builds on a growing literature for the factorization and decomposition of linked matrices which has primarily focused on multiple matrices that are linked in one dimension (rows or columns) only. Our objective function extends nuclear norm penalization, is motivated by random matrix theory, gives a unique decomposition under relatively mild conditions, and can be shown to give the mode of a Bayesian posterior distribution. We apply BIDIFAC+ to pan-omics pan-cancer data from TCGA, identifying shared and specific modes of variability across four different omics platforms and 29 different cancer types.
Collapse
Affiliation(s)
- Eric F. Lock
- Division of Biostatistics, School of Public Health, University of Minnesota
| | - Jun Young Park
- Department of Statistical Sciences, Faculty of Arts & Science, University of Toronto
| | - Katherine A. Hoadley
- Department of Genetics, Computational Medicine Program, University of North Carolina
| |
Collapse
|
56
|
Multi-Omics Profiling of the Tumor Microenvironment. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2022; 1361:283-326. [DOI: 10.1007/978-3-030-91836-1_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
57
|
Nakazawa MA, Tamada Y, Tanaka Y, Ikeguchi M, Higashihara K, Okuno Y. Novel cancer subtyping method based on patient-specific gene regulatory network. Sci Rep 2021; 11:23653. [PMID: 34880275 PMCID: PMC8654869 DOI: 10.1038/s41598-021-02394-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 11/12/2021] [Indexed: 12/11/2022] Open
Abstract
The identification of cancer subtypes is important for the understanding of tumor heterogeneity. In recent years, numerous computational methods have been proposed for this problem based on the multi-omics data of patients. It is widely accepted that different cancer subtypes are induced by different molecular regulatory networks. However, only a few incorporate the differences between their molecular systems into the identification processes. In this study, we present a novel method to identify cancer subtypes based on patient-specific molecular systems. Our method realizes this by quantifying patient-specific gene networks, which are estimated from their transcriptome data, and by clustering their quantified networks. Comprehensive analyses of The Cancer Genome Atlas (TCGA) datasets applied to our method confirmed that they were able to identify more clinically meaningful cancer subtypes than the existing subtypes and found that the identified subtypes comprised different molecular features. Our findings also show that the proposed method can identify the novel cancer subtypes even with single omics data, which cannot otherwise be captured by existing methods using multi-omics data.
Collapse
Affiliation(s)
| | - Yoshinori Tamada
- Graduate School of Medicine, Kyoto University, Kyoto, 606-8507, Japan.
- Innovation Center for Health Promotion, Hirosaki University, Hirosaki, 036-8562, Japan.
| | - Yoshihisa Tanaka
- Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto, 606-8507, Japan
- Biomedical Computational Intelligence Unit, HPC- and AI-driven Drug Development Platform Division, RIKEN Center for Computational Science, Kobe, 650-0047, Japan
| | - Marie Ikeguchi
- Graduate School of Medicine, Kyoto University, Kyoto, 606-8507, Japan
| | - Kako Higashihara
- Graduate School of Medicine, Kyoto University, Kyoto, 606-8507, Japan
| | - Yasushi Okuno
- Graduate School of Medicine, Kyoto University, Kyoto, 606-8507, Japan.
- Biomedical Computational Intelligence Unit, HPC- and AI-driven Drug Development Platform Division, RIKEN Center for Computational Science, Kobe, 650-0047, Japan.
| |
Collapse
|
58
|
Carmichael I, Calhoun BC, Hoadley KA, Troester MA, Geradts J, Couture HD, Olsson L, Perou CM, Niethammer M, Hannig J, Marron JS. JOINT AND INDIVIDUAL ANALYSIS OF BREAST CANCER HISTOLOGIC IMAGES AND GENOMIC COVARIATES. Ann Appl Stat 2021; 15:1697-1722. [PMID: 35432688 PMCID: PMC9007558 DOI: 10.1214/20-aoas1433] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/30/2023]
Abstract
The two main approaches in the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genomics. While both histopathology and genomics are fundamental to cancer research, the connections between these fields have been relatively superficial. We bridge this gap by investigating the Carolina Breast Cancer Study through the development of an integrative, exploratory analysis framework. Our analysis gives insights - some known, some novel - that are engaging to both pathologists and geneticists. Our analysis framework is based on Angle-based Joint and Individual Variation Explained (AJIVE) for statistical data integration and exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction. CNNs raise interpretability issues that we address by developing novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | - Jan Hannig
- University of North Carolina at Chapel Hill
| | - J S Marron
- University of North Carolina at Chapel Hill
| |
Collapse
|
59
|
Nguyen H, Tran D, Tran B, Roy M, Cassell A, Dascalu S, Draghici S, Nguyen T. SMRT: Randomized Data Transformation for Cancer Subtyping and Big Data Analysis. Front Oncol 2021; 11:725133. [PMID: 34745946 PMCID: PMC8563705 DOI: 10.3389/fonc.2021.725133] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 09/28/2021] [Indexed: 12/25/2022] Open
Abstract
Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. The treatment options, as well as treatment success, are highly dependent on the correct subtyping of individual patients. With the advancement of high-throughput platforms, we have the opportunity to differentiate among cancer subtypes from a holistic perspective that takes into consideration phenomena at different molecular levels (mRNA, methylation, etc.). This demands powerful integrative methods to leverage large multi-omics datasets for a better subtyping. Here we introduce Subtyping Multi-omics using a Randomized Transformation (SMRT), a new method for multi-omics integration and cancer subtyping. SMRT offers the following advantages over existing approaches: (i) the scalable analysis pipeline allows researchers to integrate multi-omics data and analyze hundreds of thousands of samples in minutes, (ii) the ability to integrate data types with different numbers of patients, (iii) the ability to analyze un-matched data of different types, and (iv) the ability to offer users a convenient data analysis pipeline through a web application. We also improve the efficiency of our ensemble-based, perturbation clustering to support analysis on machines with memory constraints. In an extensive analysis, we compare SMRT with eight state-of-the-art subtyping methods using 37 TCGA and two METABRIC datasets comprising a total of almost 12,000 patient samples from 28 different types of cancer. We also performed a number of simulation studies. We demonstrate that SMRT outperforms other methods in identifying subtypes with significantly different survival profiles. In addition, SMRT is extremely fast, being able to analyze hundreds of thousands of samples in minutes. The web application is available at http://SMRT.tinnguyen-lab.com. The R package will be deposited to CRAN as part of our PINSPlus software suite.
Collapse
Affiliation(s)
- Hung Nguyen
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Duc Tran
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Bang Tran
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Monikrishna Roy
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Adam Cassell
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Sergiu Dascalu
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, United States
| | - Tin Nguyen
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| |
Collapse
|
60
|
Miao Z, Humphreys BD, McMahon AP, Kim J. Multi-omics integration in the age of million single-cell data. Nat Rev Nephrol 2021; 17:710-724. [PMID: 34417589 PMCID: PMC9191639 DOI: 10.1038/s41581-021-00463-x] [Citation(s) in RCA: 102] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/25/2021] [Indexed: 02/06/2023]
Abstract
An explosion in single-cell technologies has revealed a previously underappreciated heterogeneity of cell types and novel cell-state associations with sex, disease, development and other processes. Starting with transcriptome analyses, single-cell techniques have extended to multi-omics approaches and now enable the simultaneous measurement of data modalities and spatial cellular context. Data are now available for millions of cells, for whole-genome measurements and for multiple modalities. Although analyses of such multimodal datasets have the potential to provide new insights into biological processes that cannot be inferred with a single mode of assay, the integration of very large, complex, multimodal data into biological models and mechanisms represents a considerable challenge. An understanding of the principles of data integration and visualization methods is required to determine what methods are best applied to a particular single-cell dataset. Each class of method has advantages and pitfalls in terms of its ability to achieve various biological goals, including cell-type classification, regulatory network modelling and biological process inference. In choosing a data integration strategy, consideration must be given to whether the multi-omics data are matched (that is, measured on the same cell) or unmatched (that is, measured on different cells) and, more importantly, the overall modelling and visualization goals of the integrated analysis.
Collapse
Affiliation(s)
- Zhen Miao
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Benjamin D Humphreys
- Division of Nephrology, Department of Medicine, Washington University in St. Louis, St. Louis, MO, USA
| | - Andrew P McMahon
- Department of Stem Cell Biology and Regenerative Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Junhyong Kim
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA.
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
61
|
Shiga M, Seno S, Onizuka M, Matsuda H. SC-JNMF: single-cell clustering integrating multiple quantification methods based on joint non-negative matrix factorization. PeerJ 2021; 9:e12087. [PMID: 34532161 PMCID: PMC8404576 DOI: 10.7717/peerj.12087] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Accepted: 08/07/2021] [Indexed: 11/20/2022] Open
Abstract
Single-cell RNA-sequencing is a rapidly evolving technology that enables us to understand biological processes at unprecedented resolution. Single-cell expression analysis requires a complex data processing pipeline, and the pipeline is divided into two main parts: The quantification part, which converts the sequence information into gene-cell matrix data; the analysis part, which analyzes the matrix data using statistics and/or machine learning techniques. In the analysis part, unsupervised cell clustering plays an important role in identifying cell types and discovering cell diversity and subpopulations. Identified cell clusters are also used for subsequent analysis, such as finding differentially expressed genes and inferring cell trajectories. However, single-cell clustering using gene expression profiles shows different results depending on the quantification methods. Clustering results are greatly affected by the quantification method used in the upstream process. In other words, even if the original RNA-sequence data is the same, gene expression profiles processed by different quantification methods will produce different clusters. In this article, we propose a robust and highly accurate clustering method based on joint non-negative matrix factorization (joint-NMF) by utilizing the information from multiple gene expression profiles quantified using different methods from the same RNA-sequence data. Our joint-NMF can extract common factors among multiple gene expression profiles by applying each NMF under the constraint that one of the factorized matrices is shared among multiple NMFs. The joint-NMF determines more robust and accurate cell clustering results by leveraging multiple quantification methods compared to conventional clustering methods, which use only a single gene expression profile. Additionally, we showed the usefulness of discovering marker genes with the extracted features using our method.
Collapse
Affiliation(s)
- Mikio Shiga
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| | - Shigeto Seno
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| | - Makoto Onizuka
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| | - Hideo Matsuda
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| |
Collapse
|
62
|
Heo YJ, Hwa C, Lee GH, Park JM, An JY. Integrative Multi-Omics Approaches in Cancer Research: From Biological Networks to Clinical Subtypes. Mol Cells 2021; 44:433-443. [PMID: 34238766 PMCID: PMC8334347 DOI: 10.14348/molcells.2021.0042] [Citation(s) in RCA: 88] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2021] [Revised: 04/09/2021] [Accepted: 05/12/2021] [Indexed: 11/27/2022] Open
Abstract
Multi-omics approaches are novel frameworks that integrate multiple omics datasets generated from the same patients to better understand the molecular and clinical features of cancers. A wide range of emerging omics and multi-view clustering algorithms now provide unprecedented opportunities to further classify cancers into subtypes, improve the survival prediction and therapeutic outcome of these subtypes, and understand key pathophysiological processes through different molecular layers. In this review, we overview the concept and rationale of multi-omics approaches in cancer research. We also introduce recent advances in the development of multi-omics algorithms and integration methods for multiple-layered datasets from cancer patients. Finally, we summarize the latest findings from large-scale multi-omics studies of various cancers and their implications for patient subtyping and drug development.
Collapse
Affiliation(s)
- Yong Jin Heo
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
- Department of Integrated Biomedical and Life Science, Korea University, Seoul 02841, Korea
| | - Chanwoong Hwa
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
| | - Gang-Hee Lee
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
| | - Jae-Min Park
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
| | - Joon-Yong An
- School of Biosystem and Biomedical Science, College of Health Science, Korea University, Seoul 02841, Korea
- Department of Integrated Biomedical and Life Science, Korea University, Seoul 02841, Korea
| |
Collapse
|
63
|
Ding Q, Sun Y, Shang J, Li F, Zhang Y, Liu JX. NMFNA: A Non-negative Matrix Factorization Network Analysis Method for Identifying Modules and Characteristic Genes of Pancreatic Cancer. Front Genet 2021; 12:678642. [PMID: 34367241 PMCID: PMC8340025 DOI: 10.3389/fgene.2021.678642] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Accepted: 06/03/2021] [Indexed: 01/15/2023] Open
Abstract
Pancreatic cancer (PC) is a highly fatal disease, yet its causes remain unclear. Comprehensive analysis of different types of PC genetic data plays a crucial role in understanding its pathogenic mechanisms. Currently, non-negative matrix factorization (NMF)-based methods are widely used for genetic data analysis. Nevertheless, it is a challenge for them to integrate and decompose different types of genetic data simultaneously. In this paper, a non-NMF network analysis method, NMFNA, is proposed, which introduces a graph-regularized constraint to the NMF, for identifying modules and characteristic genes from two-type PC data of methylation (ME) and copy number variation (CNV). Firstly, three PC networks, i.e., ME network, CNV network, and ME-CNV network, are constructed using the Pearson correlation coefficient (PCC). Then, modules are detected from these three PC networks effectively due to the introduced graph-regularized constraint, which is the highlight of the NMFNA. Finally, both gene ontology (GO) and pathway enrichment analyses are performed, and characteristic genes are detected by the multimeasure score, to deeply understand biological functions of PC core modules. Experimental results demonstrated that the NMFNA facilitates the integration and decomposition of two types of PC data simultaneously and can further serve as an alternative method for detecting modules and characteristic genes from multiple genetic data of complex diseases.
Collapse
Affiliation(s)
- Qian Ding
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Yan Sun
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Junliang Shang
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Yuanyuan Zhang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao, China
| |
Collapse
|
64
|
Moon S, Lee H. JDSNMF: Joint Deep Semi-Non-Negative Matrix Factorization for Learning Integrative Representation of Molecular Signals in Alzheimer's Disease. J Pers Med 2021; 11:jpm11080686. [PMID: 34442330 PMCID: PMC8400727 DOI: 10.3390/jpm11080686] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 07/16/2021] [Accepted: 07/18/2021] [Indexed: 12/14/2022] Open
Abstract
High dimensional multi-omics data integration can enhance our understanding of the complex biological interactions in human diseases. However, most studies involving unsupervised integration of multi-omics data focus on linear integration methods. In this study, we propose a joint deep semi-non-negative matrix factorization (JDSNMF) model, which uses a hierarchical non-linear feature extraction approach that can capture shared latent features from the complex multi-omics data. The extracted latent features obtained from JDSNMF enabled a variety of downstream tasks, including prediction of disease and module analysis. The proposed model is applicable not only to sample-matched multiple data (e.g., multi-omics data from one cohort) but also to feature-matched multiple data (e.g., omics data from multiple cohorts), and therefore it can be flexibly applied to various cases. We demonstrate the capabilities of JDSNMF using sample-matched simulated data and feature-matched multi-omics data from Alzheimer’s disease cohorts, evaluating the feature extraction performance in the context of classification. In a test application, we identify AD- and age-related modules from the latent matrices using an explainable artificial intelligence and regression model. These results show that the JDSNMF model is effective in identifying latent features having a complex interplay of potential biological signatures.
Collapse
|
65
|
Song D, Li K, Hemminger Z, Wollman R, Li JJ. scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling. Bioinformatics 2021; 37:i358-i366. [PMID: 34252925 PMCID: PMC8275345 DOI: 10.1093/bioinformatics/btab273] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Motivation Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data. Results Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data. Availability and implementation The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246, USA
| | - Kexin Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA
| | - Zachary Hemminger
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, CA 90095, USA.,Department of Integrative Biology and Physiology, University of California, Los Angeles, CA 90095-7239, USA
| | - Roy Wollman
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, CA 90095, USA.,Department of Integrative Biology and Physiology, University of California, Los Angeles, CA 90095-7239, USA.,Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095-1569, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA.,Department of Human Genetics, University of California, Los Angeles, CA 90095-7088, USA.,Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766, USA.,Department of Biostatistics, University of California Los Angeles, CA 90095-1772, USA
| |
Collapse
|
66
|
Zhao S, Tsibris A. Leveraging Novel Integrated Single-Cell Analyses to Define HIV-1 Latency Reversal. Viruses 2021; 13:1197. [PMID: 34206546 PMCID: PMC8310207 DOI: 10.3390/v13071197] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Revised: 06/11/2021] [Accepted: 06/16/2021] [Indexed: 01/24/2023] Open
Abstract
While suppressive antiretroviral therapy can effectively limit HIV-1 replication and evolution, it leaves behind a residual pool of integrated viral genomes that persist in a state of reversible nonproductive infection, referred to as the HIV-1 reservoir. HIV-1 infection models were established to investigate HIV-1 latency and its reversal; recent work began to probe the dynamics of HIV-1 latency reversal at single-cell resolution. Signals that establish HIV-1 latency and govern its reactivation are complex and may not be completely resolved at the cellular and regulatory levels by the aggregated measurements of bulk cellular-sequencing methods. High-throughput single-cell technologies that characterize and quantify changes to the epigenome, transcriptome, and proteome continue to rapidly evolve. Combinations of single-cell techniques, in conjunction with novel computational approaches to analyze these data, were developed and provide an opportunity to improve the resolution of the heterogeneity that may exist in HIV-1 reactivation. In this review, we summarize the published single-cell HIV-1 transcriptomic work and explore how cutting-edge advances in single-cell techniques and integrative data-analysis tools may be leveraged to define the mechanisms that control the reversal of HIV-1 latency.
Collapse
Affiliation(s)
| | - Athe Tsibris
- Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02139, USA;
| |
Collapse
|
67
|
Picard M, Scott-Boyer MP, Bodein A, Périn O, Droit A. Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J 2021; 19:3735-3746. [PMID: 34285775 PMCID: PMC8258788 DOI: 10.1016/j.csbj.2021.06.030] [Citation(s) in RCA: 246] [Impact Index Per Article: 61.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 06/17/2021] [Accepted: 06/21/2021] [Indexed: 12/25/2022] Open
Abstract
Increased availability of high-throughput technologies has generated an ever-growing number of omics data that seek to portray many different but complementary biological layers including genomics, epigenomics, transcriptomics, proteomics, and metabolomics. New insight from these data have been obtained by machine learning algorithms that have produced diagnostic and classification biomarkers. Most biomarkers obtained to date however only include one omic measurement at a time and thus do not take full advantage of recent multi-omics experiments that now capture the entire complexity of biological systems. Multi-omics data integration strategies are needed to combine the complementary knowledge brought by each omics layer. We have summarized the most recent data integration methods/ frameworks into five different integration strategies: early, mixed, intermediate, late and hierarchical. In this mini-review, we focus on challenges and existing multi-omics integration strategies by paying special attention to machine learning applications.
Collapse
Affiliation(s)
- Milan Picard
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Marie-Pier Scott-Boyer
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Antoine Bodein
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Olivier Périn
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-bois, France
| | - Arnaud Droit
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
- Corresponding author.
| |
Collapse
|
68
|
Okochi Y, Sakaguchi S, Nakae K, Kondo T, Naoki H. Model-based prediction of spatial gene expression via generative linear mapping. Nat Commun 2021; 12:3731. [PMID: 34140477 PMCID: PMC8211835 DOI: 10.1038/s41467-021-24014-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2020] [Accepted: 05/27/2021] [Indexed: 11/17/2022] Open
Abstract
Decoding spatial transcriptomes from single-cell RNA sequencing (scRNA-seq) data has become a fundamental technique for understanding multicellular systems; however, existing computational methods lack both accuracy and biological interpretability due to their model-free frameworks. Here, we introduce Perler, a model-based method to integrate scRNA-seq data with reference in situ hybridization (ISH) data. To calibrate differences between these datasets, we develop a biologically interpretable model that uses generative linear mapping based on a Gaussian mixture model using the Expectation–Maximization algorithm. Perler accurately predicts the spatial gene expression of Drosophila embryos, zebrafish embryos, mammalian liver, and mouse visual cortex from scRNA-seq data. Furthermore, the reconstructed transcriptomes do not over-fit the ISH data and preserved the timing information of the scRNA-seq data. These results demonstrate the generalizability of Perler for dataset integration, thereby providing a biologically interpretable framework for accurate reconstruction of spatial transcriptomes in any multicellular system. Single cell RNA-seq loses spatial information of gene expression in multicellular systems because tissue must be dissociated. Here, the authors show the spatial gene expression profiles can be both accurately and robustly reconstructed by a new computational method using a generative linear mapping, Perler.
Collapse
Affiliation(s)
- Yasushi Okochi
- Laboratory for Theoretical Biology, Graduate School of Biostudies, Kyoto University, Kyoto, Japan.,Faculty of Medicine, Kyoto University, Kyoto, Japan
| | - Shunta Sakaguchi
- Laboratory for Cell Recognition and Pattern Formation, Graduate School of Biostudies, Kyoto University, Kyoto, Japan
| | - Ken Nakae
- Graduate School of Informatics, Kyoto Universityo, Kyoto, Japan
| | - Takefumi Kondo
- Laboratory for Cell Recognition and Pattern Formation, Graduate School of Biostudies, Kyoto University, Kyoto, Japan.,The Keihanshin Consortium for Fostering the Next Generation of Global Leaders in Research (K-CONNEX), Kyoto, Japan
| | - Honda Naoki
- Laboratory for Theoretical Biology, Graduate School of Biostudies, Kyoto University, Kyoto, Japan. .,Laboratory for Data-driven Biology, Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Hiroshima, Japan. .,Theoretical Biology Research Group, Exploratory Research Center on Life and Living Systems (ExCELLS), National Institutes of Natural Sciences, Okazaki, Aichi, Japan.
| |
Collapse
|
69
|
Chen T, Philip M, Lê Cao KA, Tyagi S. A multi-modal data harmonisation approach for discovery of COVID-19 drug targets. Brief Bioinform 2021; 22:6279836. [PMID: 34036326 PMCID: PMC8194516 DOI: 10.1093/bib/bbab185] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 03/09/2021] [Accepted: 04/22/2021] [Indexed: 12/27/2022] Open
Abstract
Despite the volume of experiments performed and data available, the complex biology of coronavirus SARS-COV-2 is not yet fully understood. Existing molecular profiling studies have focused on analysing functional omics data of a single type, which captures changes in a small subset of the molecular perturbations caused by the virus. As the logical next step, results from multiple such omics analysis may be aggregated to comprehensively interpret the molecular mechanisms of SARS-CoV-2. An alternative approach is to integrate data simultaneously in a parallel fashion to highlight the inter-relationships of disease-driving biomolecules, in contrast to comparing processed information from each omics level separately. We demonstrate that valuable information may be masked by using the former fragmented views in analysis, and biomarkers resulting from such an approach cannot provide a systematic understanding of the disease aetiology. Hence, we present a generic, reproducible and flexible open-access data harmonisation framework that can be scaled out to future multi-omics analysis to study a phenotype in a holistic manner. The pipeline source code, detailed documentation and automated version as a R package are accessible. To demonstrate the effectiveness of our pipeline, we applied it to a drug screening task. We integrated multi-omics data to find the lowest level of statistical associations between data features in two case studies. Strongly correlated features within each of these two datasets were used for drug-target analysis, resulting in a list of 84 drug-target candidates. Further computational docking and toxicity analyses revealed seven high-confidence targets, amsacrine, bosutinib, ceritinib, crizotinib, nintedanib and sunitinib as potential starting points for drug therapy and development.
Collapse
Affiliation(s)
- Tyrone Chen
- School of Biological Sciences, Monash University, 25 Rainforest Walk, 3800, VIC, Australia
| | - Melcy Philip
- School of Biological Sciences, Monash University, 25 Rainforest Walk, 3800, VIC, Australia
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, University of Melbourne, Building 184, Royal Parade, 3010, VIC, Australia.,School of Mathematics and Statistics, University of Melbourne, 813 Swanston Street, 3010, VIC, Australia
| | - Sonika Tyagi
- School of Biological Sciences, Monash University, 25 Rainforest Walk, 3800, VIC, Australia.,Monash eResearch Centre, Monash University, 15 Innovation Walk, 3800, VIC, Australia.,Department of Infectious Disease, Monash University, 85 Commercial Road, 3004, VIC, Australia
| |
Collapse
|
70
|
Adossa N, Khan S, Rytkönen KT, Elo LL. Computational strategies for single-cell multi-omics integration. Comput Struct Biotechnol J 2021; 19:2588-2596. [PMID: 34025945 PMCID: PMC8114078 DOI: 10.1016/j.csbj.2021.04.060] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2021] [Revised: 04/23/2021] [Accepted: 04/24/2021] [Indexed: 02/06/2023] Open
Abstract
Single-cell omics technologies are currently solving biological and medical problems that earlier have remained elusive, such as discovery of new cell types, cellular differentiation trajectories and communication networks across cells and tissues. Current advances especially in single-cell multi-omics hold high potential for breakthroughs by integration of multiple different omics layers. To pair with the recent biotechnological developments, many computational approaches to process and analyze single-cell multi-omics data have been proposed. In this review, we first introduce recent developments in single-cell multi-omics in general and then focus on the available data integration strategies. The integration approaches are divided into three categories: early, intermediate, and late data integration. For each category, we describe the underlying conceptual principles and main characteristics, as well as provide examples of currently available tools and how they have been applied to analyze single-cell multi-omics data. Finally, we explore the challenges and prospective future directions of single-cell multi-omics data integration, including examples of adopting multi-view analysis approaches used in other disciplines to single-cell multi-omics.
Collapse
Affiliation(s)
- Nigatu Adossa
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
| | - Sofia Khan
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
| | - Kalle T. Rytkönen
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
- Institute of Biomedicine, University of Turku, 20520 Turku, Finland
| | - Laura L. Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
- Institute of Biomedicine, University of Turku, 20520 Turku, Finland
| |
Collapse
|
71
|
Wang B, Ma X, Xie M, Wu Y, Wang Y, Duan R, Zhang C, Yu L, Guo X, Gao L. CBP-JMF: An Improved Joint Matrix Tri-Factorization Method for Characterizing Complex Biological Processes of Diseases. Front Genet 2021; 12:665416. [PMID: 33968140 PMCID: PMC8103031 DOI: 10.3389/fgene.2021.665416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 03/01/2021] [Indexed: 11/13/2022] Open
Abstract
Multi-omics molecules regulate complex biological processes (CBPs), which reflect the activities of various molecules in living organisms. Meanwhile, the applications to represent disease subtypes and cell types have created an urgent need for sample grouping and associated CBP-inferring tools. In this paper, we present CBP-JMF, a practical tool primarily for discovering CBPs, which underlie sample groups as disease subtypes in applications. Differently from existing methods, CBP-JMF is based on a joint non-negative matrix tri-factorization framework and is implemented in Python. As a pragmatic application, we apply CBP-JMF to identify CBPs for four subtypes of breast cancer. The result shows significant overlapping between genes extracted from CBPs and known subtype pathways. We verify the effectiveness of our tool in detecting CBPs that interpret subtypes of disease.
Collapse
Affiliation(s)
- Bingbo Wang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiujuan Ma
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Minghui Xie
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yue Wu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yajun Wang
- School of Humanities and Foreign Languages, Xi'an University of Technology, Xi'an, China
| | - Ran Duan
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Chenxing Zhang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xingli Guo
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
72
|
Liang L, Zhu K, Tao J, Lu S. ORN: Inferring patient-specific dysregulation status of pathway modules in cancer with OR-gate Network. PLoS Comput Biol 2021; 17:e1008792. [PMID: 33819263 PMCID: PMC8049496 DOI: 10.1371/journal.pcbi.1008792] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 04/15/2021] [Accepted: 02/15/2021] [Indexed: 01/26/2023] Open
Abstract
Pathway level understanding of cancer plays a key role in precision oncology. However, the current amount of high-throughput data cannot support the elucidation of full pathway topology. In this study, instead of directly learning the pathway network, we adapted the probabilistic OR gate to model the modular structure of pathways and regulon. The resulting model, OR-gate Network (ORN), can simultaneously infer pathway modules of somatic alterations, patient-specific pathway dysregulation status, and downstream regulon. In a trained ORN, the differentially expressed genes (DEGs) in each tumour can be explained by somatic mutations perturbing a pathway module. Furthermore, the ORN handles one of the most important properties of pathway perturbation in tumours, the mutual exclusivity. We have applied the ORN to lower-grade glioma (LGG) samples and liver hepatocellular carcinoma (LIHC) samples in TCGA and breast cancer samples from METABRIC. Both datasets have shown abnormal pathway activities related to immune response and cell cycles. In LGG samples, ORN identified pathway modules closely related to glioma development and revealed two pathways closely related to patient survival. We had similar results with LIHC samples. Additional results from the METABRIC datasets showed that ORN could characterize critical mechanisms of cancer and connect them to less studied somatic mutations (e.g., BAP1, MIR604, MICAL3, and telomere activities), which may generate novel hypothesis for targeted therapy.
Collapse
Affiliation(s)
- Lifan Liang
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Kunju Zhu
- Clinical Medicine Research Institute, Jinan University, Guangzhou, Guangdong, China
| | - Junyan Tao
- Department of Pathology, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Songjian Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
73
|
Auslander N, Gussow AB, Koonin EV. Incorporating Machine Learning into Established Bioinformatics Frameworks. Int J Mol Sci 2021; 22:2903. [PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/08/2021] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open
Abstract
The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges.
Collapse
Affiliation(s)
| | | | - Eugene V. Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
| |
Collapse
|
74
|
Identifying multimodal signatures underlying the somatic comorbidity of psychosis: the COMMITMENT roadmap. Mol Psychiatry 2021; 26:722-724. [PMID: 33060817 PMCID: PMC7910206 DOI: 10.1038/s41380-020-00915-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Revised: 09/16/2020] [Accepted: 10/02/2020] [Indexed: 11/25/2022]
|
75
|
Wang M, Allen GI. Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed Multi-View Data. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2021; 22:55. [PMID: 34744522 PMCID: PMC8570363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.
Collapse
Affiliation(s)
- Minjie Wang
- Department of Statistics, Rice University, Houston, TX 77005, USA
| | - Genevera I Allen
- Departments of Electrical and Computer Engineering, Statistics, and Computer Science, Rice University, Houston, TX 77005, USA; Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine, Houston, TX 77030, USA
| |
Collapse
|
76
|
Zhu H, Li G, Lock EF. Generalized integrative principal component analysis for multi-type data with block-wise missing structure. Biostatistics 2020; 21:302-318. [PMID: 30247540 DOI: 10.1093/biostatistics/kxy052] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Accepted: 08/15/2018] [Indexed: 12/19/2022] Open
Abstract
High-dimensional multi-source data are encountered in many fields. Despite recent developments on the integrative dimension reduction of such data, most existing methods cannot easily accommodate data of multiple types (e.g. binary or count-valued). Moreover, multi-source data often have block-wise missing structure, i.e. data in one or more sources may be completely unobserved for a sample. The heterogeneous data types and presence of block-wise missing data pose significant challenges to the integration of multi-source data and further statistical analyses. In this article, we develop a low-rank method, called generalized integrative principal component analysis (GIPCA), for the simultaneous dimension reduction and imputation of multi-source block-wise missing data, where different sources may have different data types. We also devise an adapted Bayesian information criterion (BIC) criterion for rank estimation. Comprehensive simulation studies demonstrate the efficacy of the proposed method in terms of rank estimation, signal recovery, and missing data imputation. We apply GIPCA to a mortality study. We achieve accurate block-wise missing data imputation and identify intriguing latent mortality rate patterns with sociological relevance.
Collapse
Affiliation(s)
- Huichen Zhu
- The Department of Biostatistics, Columbia University, 722 West 168th St., New York, NY, USA
| | - Gen Li
- The Department of Biostatistics, Columbia University, 722 West 168th St., New York, NY, USA
| | - Eric F Lock
- The Division of Biostatistics, School of Public Health, University of Minneapolis, 420 Delaware Street S.E., Minneapolis, MN, USA
| |
Collapse
|
77
|
Park M, Kim D, Moon K, Park T. Integrative Analysis of Multi-Omics Data Based on Blockwise Sparse Principal Components. Int J Mol Sci 2020; 21:E8202. [PMID: 33147797 PMCID: PMC7663540 DOI: 10.3390/ijms21218202] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Revised: 10/27/2020] [Accepted: 10/31/2020] [Indexed: 01/14/2023] Open
Abstract
The recent development of high-throughput technology has allowed us to accumulate vast amounts of multi-omics data. Because even single omics data have a large number of variables, integrated analysis of multi-omics data suffers from problems such as computational instability and variable redundancy. Most multi-omics data analyses apply single supervised analysis, repeatedly, for dimensional reduction and variable selection. However, these approaches cannot avoid the problems of redundancy and collinearity of variables. In this study, we propose a novel approach using blockwise component analysis. This would solve the limitations of current methods by applying variable clustering and sparse principal component (sPC) analysis. Our approach consists of two stages. The first stage identifies homogeneous variable blocks, and then extracts sPCs, for each omics dataset. The second stage merges sPCs from each omics dataset, and then constructs a prediction model. We also propose a graphical method showing the results of sparse PCA and model fitting, simultaneously. We applied the proposed methodology to glioblastoma multiforme data from The Cancer Genome Atlas. The comparison with other existing approaches showed that our proposed methodology is more easily interpretable than other approaches, and has comparable predictive power, with a much smaller number of variables.
Collapse
Affiliation(s)
- Mira Park
- Department of Preventive Medicine, Eulji University, Daejeon 34824, Korea;
| | - Doyoen Kim
- Department of Statistics, Korea University, Seoul 02841, Korea; (D.K.); (K.M.)
| | - Kwanyoung Moon
- Department of Statistics, Korea University, Seoul 02841, Korea; (D.K.); (K.M.)
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul 08826, Korea
| |
Collapse
|
78
|
Fantini D, Vidimar V, Yu Y, Condello S, Meeks JJ. MutSignatures: an R package for extraction and analysis of cancer mutational signatures. Sci Rep 2020; 10:18217. [PMID: 33106540 PMCID: PMC7589488 DOI: 10.1038/s41598-020-75062-0] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 10/09/2020] [Indexed: 12/31/2022] Open
Abstract
Cancer cells accumulate somatic mutations as result of DNA damage, inaccurate repair and other mechanisms. Different genetic instability processes result in characteristic non-random patterns of DNA mutations, also known as mutational signatures. We developed mutSignatures, an integrated R-based computational framework aimed at deciphering DNA mutational signatures. Our software provides advanced functions for importing DNA variants, computing mutation types, and extracting mutational signatures via non-negative matrix factorization. Specifically, mutSignatures accepts multiple types of input data, is compatible with non-human genomes, and supports the analysis of non-standard mutation types, such as tetra-nucleotide mutation types. We applied mutSignatures to analyze somatic mutations found in smoking-related cancer datasets. We characterized mutational signatures that were consistent with those reported before in independent investigations. Our work demonstrates that selected mutational signatures correlated with specific clinical and molecular features across different cancer types, and revealed complementarity of specific mutational patterns that has not previously been identified. In conclusion, we propose mutSignatures as a powerful open-source tool for detecting the molecular determinants of cancer and gathering insights into cancer biology and treatment.
Collapse
Affiliation(s)
- Damiano Fantini
- Department of Urology, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA. .,Robert H. Lurie Comprehensive Cancer Center, Northwestern University, Chicago, IL, USA.
| | - Vania Vidimar
- Department of Microbiology-Immunology, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA.
| | - Yanni Yu
- Department of Urology, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA.,Robert H. Lurie Comprehensive Cancer Center, Northwestern University, Chicago, IL, USA.,Department of Biochemistry and Molecular Genetics, Northwestern University, Chicago, IL, USA
| | - Salvatore Condello
- Department of Obstetrics and Gynecology, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Joshua J Meeks
- Department of Urology, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA.,Robert H. Lurie Comprehensive Cancer Center, Northwestern University, Chicago, IL, USA.,Department of Biochemistry and Molecular Genetics, Northwestern University, Chicago, IL, USA
| |
Collapse
|
79
|
Li J, Lu Q, Wen Y. Multi-kernel linear mixed model with adaptive lasso for prediction analysis on high-dimensional multi-omics data. Bioinformatics 2020; 36:1785-1794. [PMID: 31693075 DOI: 10.1093/bioinformatics/btz822] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Revised: 10/08/2019] [Accepted: 11/01/2019] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION The use of human genome discoveries and other established factors to build an accurate risk prediction model is an essential step toward precision medicine. While multi-layer high-dimensional omics data provide unprecedented data resources for prediction studies, their corresponding analytical methods are much less developed. RESULTS We present a multi-kernel penalized linear mixed model with adaptive lasso (MKpLMM), a predictive modeling framework that extends the standard linear mixed models widely used in genomic risk prediction, for multi-omics data analysis. MKpLMM can capture not only the predictive effects from each layer of omics data but also their interactions via using multiple kernel functions. It adopts a data-driven approach to select predictive regions as well as predictive layers of omics data, and achieves robust selection performance. Through extensive simulation studies, the analyses of PET-imaging outcomes from the Alzheimer's Disease Neuroimaging Initiative study, and the analyses of 64 drug responses, we demonstrate that MKpLMM consistently outperforms competing methods in phenotype prediction. AVAILABILITY AND IMPLEMENTATION The R-package is available at https://github.com/YaluWen/OmicPred. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Li
- Department of Thoracic Surgery, Dalian Municipal Central Hospital Affiliated of Dalian Medical University, Dalian 116000, China
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| | - Yalu Wen
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| |
Collapse
|
80
|
Ishikawa Y, Nakai K. A hypothetical trivalent epigenetic code that affects the nature of human ESCs. PLoS One 2020; 15:e0238742. [PMID: 32911515 PMCID: PMC7482980 DOI: 10.1371/journal.pone.0238742] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Accepted: 08/21/2020] [Indexed: 02/07/2023] Open
Abstract
It has been suggested that DNA methylation can work in concert with other epigenetic factors, leading to changes in cellular phenotypes. For example, DNA demethylation modifications producing 5-hydroxymethylcytosine (5hmC) are thought to interact with histone modifications to influence the acquisition of embryonic stem cell (ESC) potency. However, the mechanism by which this occurs is still unknown. Thus, we systematically analysed the co-occurrence of DNA and histone modifications at genic regions as well as their relationship with ESC-specific expression using a number of heterogeneous public datasets. From a set of 19 epigenetic factors, we found remarkable co-occurrence of 5hmC and H4K8ac, accompanied by H3K4me1. This enrichment was more prominent at gene body regions. The results were confirmed using data obtained from different detection methods and species. Our analysis shows that these marks work cooperatively to influence ESC-specific gene expression. We also found that this trivalent mark is relatively enriched in genes related with immunity, which is a bit specific in ESCs. We propose that a trivalent epigenetic mark, composed of 5hmC, H4K8ac and H3K4me1, regulates gene expression and modulates the nature of human ESCs as a novel epigenetic code.
Collapse
Affiliation(s)
- Yasuhisa Ishikawa
- Department of Computational Biology and Medical Sciences, the University of Tokyo, Kashiwa-shi, Chiba, Japan
| | - Kenta Nakai
- Department of Computational Biology and Medical Sciences, the University of Tokyo, Kashiwa-shi, Chiba, Japan
- Human Genome Center, the Institute of Medical Science, the University of Tokyo, Tokyo, Japan
| |
Collapse
|
81
|
Fan J, Slowikowski K, Zhang F. Single-cell transcriptomics in cancer: computational challenges and opportunities. Exp Mol Med 2020; 52:1452-1465. [PMID: 32929226 PMCID: PMC8080633 DOI: 10.1038/s12276-020-0422-0] [Citation(s) in RCA: 116] [Impact Index Per Article: 23.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2019] [Revised: 02/26/2020] [Accepted: 03/10/2020] [Indexed: 02/07/2023] Open
Abstract
Intratumor heterogeneity is a common characteristic across diverse cancer types and presents challenges to current standards of treatment. Advancements in high-throughput sequencing and imaging technologies provide opportunities to identify and characterize these aspects of heterogeneity. Notably, transcriptomic profiling at a single-cell resolution enables quantitative measurements of the molecular activity that underlies the phenotypic diversity of cells within a tumor. Such high-dimensional data require computational analysis to extract relevant biological insights about the cell types and states that drive cancer development, pathogenesis, and clinical outcomes. In this review, we highlight emerging themes in the computational analysis of single-cell transcriptomics data and their applications to cancer research. We focus on downstream analytical challenges relevant to cancer research, including how to computationally perform unified analysis across many patients and disease states, distinguish neoplastic from nonneoplastic cells, infer communication with the tumor microenvironment, and delineate tumoral and microenvironmental evolution with trajectory and RNA velocity analysis. We include discussions of challenges and opportunities for future computational methodological advancements necessary to realize the translational potential of single-cell transcriptomic profiling in cancer.
Collapse
Affiliation(s)
- Jean Fan
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA.
| | - Kamil Slowikowski
- Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital, Charlestown, MA, USA
| | - Fan Zhang
- Center for Data Sciences, Brigham and Women's Hospital, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
82
|
Hu X, Hu Y, Wu F, Leung RWT, Qin J. Integration of single-cell multi-omics for gene regulatory network inference. Comput Struct Biotechnol J 2020; 18:1925-1938. [PMID: 32774787 PMCID: PMC7385034 DOI: 10.1016/j.csbj.2020.06.033] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Revised: 06/17/2020] [Accepted: 06/20/2020] [Indexed: 12/20/2022] Open
Abstract
The advancement of single-cell sequencing technology in recent years has provided an opportunity to reconstruct gene regulatory networks (GRNs) with the data from thousands of single cells in one sample. This uncovers regulatory interactions in cells and speeds up the discoveries of regulatory mechanisms in diseases and biological processes. Therefore, more methods have been proposed to reconstruct GRNs using single-cell sequencing data. In this review, we introduce technologies for sequencing single-cell genome, transcriptome, and epigenome. At the same time, we present an overview of current GRN reconstruction strategies utilizing different single-cell sequencing data. Bioinformatics tools were grouped by their input data type and mathematical principles for reader's convenience, and the fundamental mathematics inherent in each group will be discussed. Furthermore, the adaptabilities and limitations of these different methods will also be summarized and compared, with the hope to facilitate researchers recognizing the most suitable tools for them.
Collapse
Affiliation(s)
- Xinlin Hu
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, College of Mathematics and Statistics, Shenzhen University, Shenzhen 518060, China
| | - Yaohua Hu
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, College of Mathematics and Statistics, Shenzhen University, Shenzhen 518060, China
| | - Fanjie Wu
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen 518107, China
| | - Ricky Wai Tak Leung
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen 518107, China
| | - Jing Qin
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen 518107, China
| |
Collapse
|
83
|
Chen T, Tyagi S. Integrative computational epigenomics to build data-driven gene regulation hypotheses. Gigascience 2020; 9:giaa064. [PMID: 32543653 PMCID: PMC7297091 DOI: 10.1093/gigascience/giaa064] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 05/25/2020] [Accepted: 05/26/2020] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Diseases are complex phenotypes often arising as an emergent property of a non-linear network of genetic and epigenetic interactions. To translate this resulting state into a causal relationship with a subset of regulatory features, many experiments deploy an array of laboratory assays from multiple modalities. Often, each of these resulting datasets is large, heterogeneous, and noisy. Thus, it is non-trivial to unify these complex datasets into an interpretable phenotype. Although recent methods address this problem with varying degrees of success, they are constrained by their scopes or limitations. Therefore, an important gap in the field is the lack of a universal data harmonizer with the capability to arbitrarily integrate multi-modal datasets. RESULTS In this review, we perform a critical analysis of methods with the explicit aim of harmonizing data, as opposed to case-specific integration. This revealed that matrix factorization, latent variable analysis, and deep learning are potent strategies. Finally, we describe the properties of an ideal universal data harmonization framework. CONCLUSIONS A sufficiently advanced universal harmonizer has major medical implications, such as (i) identifying dysregulated biological pathways responsible for a disease is a powerful diagnostic tool; (2) investigating these pathways further allows the biological community to better understand a disease's mechanisms; and (3) precision medicine also benefits from developments in this area, particularly in the context of the growing field of selective epigenome editing, which can suppress or induce a desired phenotype.
Collapse
Affiliation(s)
- Tyrone Chen
- 25 Rainforest Walk, School of Biological Sciences, Monash University, Clayton, VIC 3800, Australia
| | - Sonika Tyagi
- 25 Rainforest Walk, School of Biological Sciences, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
84
|
Hu J, Zeng T, Xia Q, Huang L, Zhang Y, Zhang C, Zeng Y, Liu H, Zhang S, Huang G, Wan W, Ding Y, Hu F, Yang C, Chen L, Wang W. Identification of Key Genes for the Ultrahigh Yield of Rice Using Dynamic Cross-tissue Network Analysis. GENOMICS, PROTEOMICS & BIOINFORMATICS 2020; 18:256-270. [PMID: 32736037 PMCID: PMC7801251 DOI: 10.1016/j.gpb.2019.11.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Revised: 08/26/2019] [Accepted: 11/08/2019] [Indexed: 11/29/2022]
Abstract
Significantly increasing crop yield is a major and worldwide challenge for food supply and security. It is well-known that rice cultivated at Taoyuan in Yunnan of China can produce the highest yield worldwide. Yet, the gene regulatory mechanism underpinning this ultrahigh yield has been a mystery. Here, we systematically collected the transcriptome data for seven key tissues at different developmental stages using rice cultivated both at Taoyuan as the case group and at another regular rice planting place Jinghong as the control group. We identified the top 24 candidate high-yield genes with their network modules from these well-designed datasets by developing a novel computational systems biology method, i.e., dynamic cross-tissue (DCT) network analysis. We used one of the candidate genes, OsSPL4, whose function was previously unknown, for gene editing experimental validation of the high yield, and confirmed that OsSPL4 significantly affects panicle branching and increases the rice yield. This study, which included extensive field phenotyping, cross-tissue systems biology analyses, and functional validation, uncovered the key genes and gene regulatory networks underpinning the ultrahigh yield of rice. The DCT method could be applied to other plant or animal systems if different phenotypes under various environments with the common genome sequences of the examined sample. DCT can be downloaded from https://github.com/ztpub/DCT.
Collapse
Affiliation(s)
- Jihong Hu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan 430072, China
| | - Tao Zeng
- CAS Key Laboratory of Systems Biology, Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China; Institute of Brain-Intelligence Technology, Zhangjiang Laboratory, Shanghai 201210, China
| | - Qiongmei Xia
- Institute of Food Crop of Yunnan Academy of Agricultural Sciences, Kunming 650205, China
| | - Liyu Huang
- School of Agriculture, Yunnan University, Kunming 650500, China
| | - Yesheng Zhang
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; BGI-Baoshan, Baoshan 678004, China
| | - Chuanchao Zhang
- CAS Key Laboratory of Systems Biology, Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yan Zeng
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Hui Liu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Shilai Zhang
- School of Agriculture, Yunnan University, Kunming 650500, China
| | - Guangfu Huang
- School of Agriculture, Yunnan University, Kunming 650500, China
| | - Wenting Wan
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; Center for Ecological and Environmental Sciences, Northwestern Polytechnical University, Xi'an 710072, China
| | - Yi Ding
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan 430072, China
| | - Fengyi Hu
- School of Agriculture, Yunnan University, Kunming 650500, China.
| | - Congdang Yang
- Institute of Food Crop of Yunnan Academy of Agricultural Sciences, Kunming 650205, China.
| | - Luonan Chen
- CAS Key Laboratory of Systems Biology, Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China; Institute of Brain-Intelligence Technology, Zhangjiang Laboratory, Shanghai 201210, China; School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China.
| | - Wen Wang
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; Center for Ecological and Environmental Sciences, Northwestern Polytechnical University, Xi'an 710072, China.
| |
Collapse
|
85
|
Godichon-Baggioni A, Maugis-Rabusseau C, Rau A. Multiview cluster aggregation and splitting, with an application to multiomic breast cancer data. Ann Appl Stat 2020. [DOI: 10.1214/19-aoas1317] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
86
|
Chen G, Wang L, Diao T, Chen Y, Cao C, Zhang X. Analysis of immune-related signatures of colorectal cancer identifying two different immune phenotypes: Evidence for immune checkpoint inhibitor therapy. Oncol Lett 2020; 20:517-524. [PMID: 32565977 PMCID: PMC7285802 DOI: 10.3892/ol.2020.11605] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 02/05/2020] [Indexed: 12/24/2022] Open
Abstract
Immune checkpoint inhibitor (ICI) therapy has revolutionized the treatment of numerous types of cancer, including colorectal cancer (CRC). Patients with CRC and deficient mismatch repair or high microsatellite instability could benefit from ICI treatment, although the response rate of most patients is low. Therefore, the immune subtyping of patients with CRC is required in order to determine the subtypes suitable for ICI treatment. The present study used a cohort of patients with CRC from The Cancer Genome Atlas (TCGA) to perform molecular subtyping, with results validated in three CRC cohorts from the Gene Expression Omnibus. Non-negative matrix factorization was used to achieve consensus molecular subtyping. The tumor immune dysfunction and exclusion algorithm was used to predict potential ICI therapy responses and gene set enrichment analysis was performed to define different pathways associated with the immune response. Two distinct subtypes of CRC were finally identified in TCGA cohorts, which were characterized as significantly different prognostic subtypes (low-risk and high-risk subtypes). Higher expression of programmed death-ligand 1, higher proportion of tumor-infiltrating lymphocytes and tumor mutation burden were significantly enriched in the low-risk subtype. Further pathway analysis revealed that the low-risk subtype was associated with immune response activation and signaling pathways involved in ‘antigen processing and presentation’. Three independent CRC cohorts were used to validate the above findings. In summary, two clinical CRC subtypes were identified, which are characterized by significantly different survival outcomes and immune infiltration patterns. The findings of the present study suggest that ICI treatment may be more effective in the low-risk CRC subtype.
Collapse
Affiliation(s)
- Gang Chen
- Department of Anal and Intestinal Surgery, Tengzhou Central People's Hospital, Tengzhou, Shandong 277500, P.R. China
| | - Lin Wang
- Department of Surgery Operating Room, Tengzhou Central People's Hospital, Tengzhou, Shandong 277500, P.R. China
| | - Tongwei Diao
- Department of Anal and Intestinal Surgery, Tengzhou Central People's Hospital, Tengzhou, Shandong 277500, P.R. China
| | - Ying Chen
- Department of Anal and Intestinal Surgery, Tengzhou Central People's Hospital, Tengzhou, Shandong 277500, P.R. China
| | - Chengbo Cao
- Department of Anal and Intestinal Surgery, Tengzhou Central People's Hospital, Tengzhou, Shandong 277500, P.R. China
| | - Xindong Zhang
- Department of Pathology, Tengzhou Central People's Hospital, Tengzhou, Shandong 277500, P.R. China
| |
Collapse
|
87
|
Esposito F, Boccarelli A, Del Buono N. An NMF-Based Methodology for Selecting Biomarkers in the Landscape of Genes of Heterogeneous Cancer-Associated Fibroblast Populations. Bioinform Biol Insights 2020; 14:1177932220906827. [PMID: 32425511 PMCID: PMC7218276 DOI: 10.1177/1177932220906827] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2019] [Accepted: 01/22/2020] [Indexed: 01/27/2023] Open
Abstract
The rapid development of high-performance technologies has greatly promoted studies of molecular oncology producing large amounts of data. Even if these data are publicly available, they need to be processed and studied to extract information useful to better understand mechanisms of pathogenesis of complex diseases, such as tumors. In this article, we illustrated a procedure for mining biologically meaningful biomarkers from microarray datasets of different tumor histotypes. The proposed methodology allows to automatically identify a subset of potentially informative genes from microarray data matrices, which differs either in the number of rows (genes) and of columns (patients). The methodology integrates nonnegative matrix factorization method, a functional enrichment analysis web tool with a properly designed gene extraction procedure to allow the analysis of omics input data with different row size. The proposed methodology has been used to mine microarray of solid tumors of different embryonic origin to verify the presence of common genes characterizing the heterogeneity of cancer-associated fibroblasts. These automatically extracted biomarkers could be used to suggest appropriate therapies to inactivate the state of active fibroblasts, thus avoiding their action on tumor progression.
Collapse
Affiliation(s)
- Flavia Esposito
- Department of Electronic and Information Engineering, Politecnico di Bari, Bari, Italy
| | - Angelina Boccarelli
- Department of Biomedical Science and Human Oncology, University of Bari Medical School, Bari, Italy
| | | |
Collapse
|
88
|
Serra A, Fratello M, Cattelani L, Liampa I, Melagraki G, Kohonen P, Nymark P, Federico A, Kinaret PAS, Jagiello K, Ha MK, Choi JS, Sanabria N, Gulumian M, Puzyn T, Yoon TH, Sarimveis H, Grafström R, Afantitis A, Greco D. Transcriptomics in Toxicogenomics, Part III: Data Modelling for Risk Assessment. NANOMATERIALS (BASEL, SWITZERLAND) 2020; 10:E708. [PMID: 32276469 PMCID: PMC7221955 DOI: 10.3390/nano10040708] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 03/25/2020] [Accepted: 03/26/2020] [Indexed: 12/30/2022]
Abstract
Transcriptomics data are relevant to address a number of challenges in Toxicogenomics (TGx). After careful planning of exposure conditions and data preprocessing, the TGx data can be used in predictive toxicology, where more advanced modelling techniques are applied. The large volume of molecular profiles produced by omics-based technologies allows the development and application of artificial intelligence (AI) methods in TGx. Indeed, the publicly available omics datasets are constantly increasing together with a plethora of different methods that are made available to facilitate their analysis, interpretation and the generation of accurate and stable predictive models. In this review, we present the state-of-the-art of data modelling applied to transcriptomics data in TGx. We show how the benchmark dose (BMD) analysis can be applied to TGx data. We review read across and adverse outcome pathways (AOP) modelling methodologies. We discuss how network-based approaches can be successfully employed to clarify the mechanism of action (MOA) or specific biomarkers of exposure. We also describe the main AI methodologies applied to TGx data to create predictive classification and regression models and we address current challenges. Finally, we present a short description of deep learning (DL) and data integration methodologies applied in these contexts. Modelling of TGx data represents a valuable tool for more accurate chemical safety assessment. This review is the third part of a three-article series on Transcriptomics in Toxicogenomics.
Collapse
Affiliation(s)
- Angela Serra
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
| | - Michele Fratello
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
| | - Luca Cattelani
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
| | - Irene Liampa
- School of Chemical Engineering, National Technical University of Athens, 157 80 Athens, Greece; (I.L.); (H.S.)
| | - Georgia Melagraki
- Nanoinformatics Department, NovaMechanics Ltd., Nicosia 1065, Cyprus; (G.M.); (A.A.)
| | - Pekka Kohonen
- Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden; (P.K.); (P.N.); (R.G.)
- Division of Toxicology, Misvik Biology, 20520 Turku, Finland
| | - Penny Nymark
- Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden; (P.K.); (P.N.); (R.G.)
- Division of Toxicology, Misvik Biology, 20520 Turku, Finland
| | - Antonio Federico
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
| | - Pia Anneli Sofia Kinaret
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
- Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| | - Karolina Jagiello
- QSAR Lab Ltd., Aleja Grunwaldzka 190/102, 80-266 Gdansk, Poland; (K.J.); (T.P.)
- University of Gdansk, Faculty of Chemistry, Wita Stwosza 63, 80-308 Gdansk, Poland
| | - My Kieu Ha
- Center for Next Generation Cytometry, Hanyang University, Seoul 04763, Korea; (M.K.H.); (J.-S.C.); (T.-H.Y.)
- Department of Chemistry, College of Natural Sciences, Hanyang University, Seoul 04763, Korea
- Institute of Next Generation Material Design, Hanyang University, Seoul 04763, Korea
| | - Jang-Sik Choi
- Center for Next Generation Cytometry, Hanyang University, Seoul 04763, Korea; (M.K.H.); (J.-S.C.); (T.-H.Y.)
- Department of Chemistry, College of Natural Sciences, Hanyang University, Seoul 04763, Korea
- Institute of Next Generation Material Design, Hanyang University, Seoul 04763, Korea
| | - Natasha Sanabria
- National Institute for Occupational Health, Johannesburg 30333, South Africa; (N.S.); (M.G.)
| | - Mary Gulumian
- National Institute for Occupational Health, Johannesburg 30333, South Africa; (N.S.); (M.G.)
- Haematology and Molecular Medicine Department, School of Pathology, University of the Witwatersrand, Johannesburg 2050, South Africa
| | - Tomasz Puzyn
- QSAR Lab Ltd., Aleja Grunwaldzka 190/102, 80-266 Gdansk, Poland; (K.J.); (T.P.)
- University of Gdansk, Faculty of Chemistry, Wita Stwosza 63, 80-308 Gdansk, Poland
| | - Tae-Hyun Yoon
- Center for Next Generation Cytometry, Hanyang University, Seoul 04763, Korea; (M.K.H.); (J.-S.C.); (T.-H.Y.)
- Department of Chemistry, College of Natural Sciences, Hanyang University, Seoul 04763, Korea
- Institute of Next Generation Material Design, Hanyang University, Seoul 04763, Korea
| | - Haralambos Sarimveis
- School of Chemical Engineering, National Technical University of Athens, 157 80 Athens, Greece; (I.L.); (H.S.)
| | - Roland Grafström
- Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden; (P.K.); (P.N.); (R.G.)
- Division of Toxicology, Misvik Biology, 20520 Turku, Finland
| | - Antreas Afantitis
- Nanoinformatics Department, NovaMechanics Ltd., Nicosia 1065, Cyprus; (G.M.); (A.A.)
| | - Dario Greco
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
- Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| |
Collapse
|
89
|
Mallik S, Zhao Z. Graph- and rule-based learning algorithms: a comprehensive review of their applications for cancer type classification and prognosis using genomic data. Brief Bioinform 2020; 21:368-394. [PMID: 30649169 PMCID: PMC7373185 DOI: 10.1093/bib/bby120] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 10/26/2018] [Accepted: 11/21/2018] [Indexed: 12/20/2022] Open
Abstract
Cancer is well recognized as a complex disease with dysregulated molecular networks or modules. Graph- and rule-based analytics have been applied extensively for cancer classification as well as prognosis using large genomic and other data over the past decade. This article provides a comprehensive review of various graph- and rule-based machine learning algorithms that have been applied to numerous genomics data to determine the cancer-specific gene modules, identify gene signature-based classifiers and carry out other related objectives of potential therapeutic value. This review focuses mainly on the methodological design and features of these algorithms to facilitate the application of these graph- and rule-based analytical approaches for cancer classification and prognosis. Based on the type of data integration, we divided all the algorithms into three categories: model-based integration, pre-processing integration and post-processing integration. Each category is further divided into four sub-categories (supervised, unsupervised, semi-supervised and survival-driven learning analyses) based on learning style. Therefore, a total of 11 categories of methods are summarized with their inputs, objectives and description, advantages and potential limitations. Next, we briefly demonstrate well-known and most recently developed algorithms for each sub-category along with salient information, such as data profiles, statistical or feature selection methods and outputs. Finally, we summarize the appropriate use and efficiency of all categories of graph- and rule mining-based learning methods when input data and specific objective are given. This review aims to help readers to select and use the appropriate algorithms for cancer classification and prognosis study.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center, Houston
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center, Houston
| |
Collapse
|
90
|
Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics Data Integration, Interpretation, and Its Application. Bioinform Biol Insights 2020; 14:1177932219899051. [PMID: 32076369 PMCID: PMC7003173 DOI: 10.1177/1177932219899051] [Citation(s) in RCA: 683] [Impact Index Per Article: 136.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/09/2019] [Indexed: 12/22/2022] Open
Abstract
To study complex biological processes holistically, it is imperative to take an integrative approach that combines multi-omics data to highlight the interrelationships of the involved biomolecules and their functions. With the advent of high-throughput techniques and availability of multi-omics data generated from a large set of samples, several promising tools and methods have been developed for data integration and interpretation. In this review, we collected the tools and methods that adopt integrative approach to analyze multiple omics data and summarized their ability to address applications such as disease subtyping, biomarker prediction, and deriving insights into the data. We provide the methodology, use-cases, and limitations of these tools; brief account of multi-omics data repositories and visualization portals; and challenges associated with multi-omics data integration.
Collapse
Affiliation(s)
| | | | | | - Abhay Jere
- Innovation Cell, Ministry of Human Resource Development, New Delhi, India
| | | |
Collapse
|
91
|
Zhang L, Zhang S. Learning common and specific patterns from data of multiple interrelated biological scenarios with matrix factorization. Nucleic Acids Res 2020; 47:6606-6617. [PMID: 31175825 PMCID: PMC6649783 DOI: 10.1093/nar/gkz488] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2019] [Revised: 05/11/2019] [Accepted: 05/22/2019] [Indexed: 11/18/2022] Open
Abstract
High-throughput biological technologies (e.g. ChIP-seq, RNA-seq and single-cell RNA-seq) rapidly accelerate the accumulation of genome-wide omics data in diverse interrelated biological scenarios (e.g. cells, tissues and conditions). Integration and differential analysis are two common paradigms for exploring and analyzing such data. However, current integrative methods usually ignore the differential part, and typical differential analysis methods either fail to identify combinatorial patterns of difference or require matched dimensions of the data. Here, we propose a flexible framework CSMF to combine them into one paradigm to simultaneously reveal Common and Specific patterns via Matrix Factorization from data generated under interrelated biological scenarios. We demonstrate the effectiveness of CSMF with four representative applications including pairwise ChIP-seq data describing the chromatin modification map between K562 and Huvec cell lines; pairwise RNA-seq data representing the expression profiles of two different cancers; RNA-seq data of three breast cancer subtypes; and single-cell RNA-seq data of human embryonic stem cell differentiation at six time points. Extensive analysis yields novel insights into hidden combinatorial patterns in these multi-modal data. Results demonstrate that CSMF is a powerful tool to uncover common and specific patterns with significant biological implications from data of interrelated biological scenarios.
Collapse
Affiliation(s)
- Lihua Zhang
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Shihua Zhang
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China
| |
Collapse
|
92
|
Park SJ, Onizuka S, Seki M, Suzuki Y, Iwata T, Nakai K. A systematic sequencing-based approach for microbial contaminant detection and functional inference. BMC Biol 2019; 17:72. [PMID: 31519179 PMCID: PMC6743104 DOI: 10.1186/s12915-019-0690-0] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Accepted: 08/20/2019] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Microbial contamination poses a major difficulty for successful data analysis in biological and biomedical research. Computational approaches utilizing next-generation sequencing (NGS) data offer promising diagnostics to assess the presence of contaminants. However, as host cells are often contaminated by multiple microorganisms, these approaches require careful attention to intra- and interspecies sequence similarities, which have not yet been fully addressed. RESULTS We present a computational approach that rigorously investigates the genomic origins of sequenced reads, including those mapped to multiple species that have been discarded in previous studies. Through the analysis of large-scale synthetic and public NGS samples, we estimate that 1000-100,000 contaminating microbial reads are detected per million host reads sequenced by RNA-seq. The microbe catalog we established included Cutibacterium as a prevalent contaminant, suggesting that contamination mostly originates from the laboratory environment. Importantly, by applying a systematic method to infer the functional impact of contamination, we revealed that host-contaminant interactions cause profound changes in the host molecular landscapes, as exemplified by changes in inflammatory and apoptotic pathways during Mycoplasma infection of lymphoma cells. CONCLUSIONS We provide a computational method for profiling microbial contamination on NGS data and suggest that sources of contamination in laboratory reagents and the experimental environment alter the molecular landscape of host cells leading to phenotypic changes. These findings reinforce the concept that precise determination of the origins and functional impacts of contamination is imperative for quality research and illustrate the usefulness of the proposed approach to comprehensively characterize contamination landscapes.
Collapse
Affiliation(s)
- Sung-Joon Park
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8693, Japan
| | - Satoru Onizuka
- Institute of Advanced Biomedical Engineering and Science, Tokyo Women's Medical University, Tokyo, 162-8666, Japan
- Division of Periodontology, Department of Oral Function, Kyushu Dental University, Fukuoka, 803-8580, Japan
| | - Masahide Seki
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, 277-8568, Japan
| | - Yutaka Suzuki
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, 277-8568, Japan
| | - Takanori Iwata
- Institute of Advanced Biomedical Engineering and Science, Tokyo Women's Medical University, Tokyo, 162-8666, Japan
- Department of Periodontology, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University, Tokyo, 113-8549, Japan
| | - Kenta Nakai
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8693, Japan.
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, 277-8568, Japan.
| |
Collapse
|
93
|
Abbas-Aghababazadeh F, Mo Q, Fridley BL. Statistical genomics in rare cancer. Semin Cancer Biol 2019; 61:1-10. [PMID: 31437624 DOI: 10.1016/j.semcancer.2019.08.021] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Revised: 08/14/2019] [Accepted: 08/17/2019] [Indexed: 12/26/2022]
Abstract
Rare cancers make of more than 20% of cancer cases. Due to the rare nature, less research has been conducted on rare cancers resulting in worse outcomes for patients with rare cancers compared to common cancers. The ability to study rare cancers is impaired by the ability to collect a large enough set of patients to complete an adequately powered genomic study. In this manuscript we outline analytical approaches and public genomic datasets that have been used in genomic studies of rare cancers. These statistical analysis approaches and study designs include: gene set / pathway analyses, pedigree and consortium studies, meta-analysis or horizontal integration, and integration of multiple types of genomic information or vertical integration. We also discuss some of the publicly available resources that can be leveraged in rare cancer genomic studies.
Collapse
Affiliation(s)
| | - Qianxing Mo
- Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, Tampa, FL, 33612, USA.
| | - Brooke L Fridley
- Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, Tampa, FL, 33612, USA.
| |
Collapse
|
94
|
Su L, Liu G, Wang J, Xu D. A rectified factor network based biclustering method for detecting cancer-related coding genes and miRNAs, and their interactions. Methods 2019; 166:22-30. [PMID: 31121299 PMCID: PMC6708461 DOI: 10.1016/j.ymeth.2019.05.010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Revised: 04/14/2019] [Accepted: 05/13/2019] [Indexed: 12/12/2022] Open
Abstract
Detecting cancer-related genes and their interactions is a crucial task in cancer research. For this purpose, we proposed an efficient method, to detect coding genes, microRNAs (miRNAs), and their interactions related to a particular cancer or a cancer subtype using their expression data from the same set of samples. Firstly, biclusters specific to a particular type of cancer are detected based on rectified factor networks and ranked according to their associations with general cancers. Secondly, coding genes and miRNAs in each bicluster are prioritized by considering their differential expression and differential correlation values, protein-protein interaction data, and potential cancer markers. Finally, a rank fusion process is used to obtain the final comprehensive rank by combining multiple ranking results. We applied our proposed method on breast cancer datasets. Results show that our method outperforms other methods in detecting breast cancer-related coding genes and miRNAs. Furthermore, our method is very efficient in computing time, which can handle tens of thousands genes/miRNAs and hundreds of patients in hours on a desktop. This work may aid researchers in studying the genetic architecture of complex diseases, and improving the accuracy of diagnosis.
Collapse
Affiliation(s)
- Lingtao Su
- Department of Computer Science and Technology, Jilin University, Changchun 130012, China; Department of Electrical Engineering & Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| | - Guixia Liu
- Department of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Juexin Wang
- Department of Electrical Engineering & Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| | - Dong Xu
- Department of Electrical Engineering & Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA.
| |
Collapse
|
95
|
Wilk G, Braun R. Integrative analysis reveals disrupted pathways regulated by microRNAs in cancer. Nucleic Acids Res 2019; 46:1089-1101. [PMID: 29294105 PMCID: PMC5814839 DOI: 10.1093/nar/gkx1250] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 12/01/2017] [Indexed: 02/06/2023] Open
Abstract
MicroRNAs (miRNAs) are small endogenous regulatory molecules that modulate gene expression post-transcriptionally. Although differential expression of miRNAs have been implicated in many diseases (including cancers), the underlying mechanisms of action remain unclear. Because each miRNA can target multiple genes, miRNAs may potentially have functional implications for the overall behavior of entire pathways. Here, we investigate the functional consequences of miRNA dysregulation through an integrative analysis of miRNA and mRNA expression data using a novel approach that incorporates pathway information a priori. By searching for miRNA-pathway associations that differ between healthy and tumor tissue, we identify specific relationships at the systems level which are disrupted in cancer. Our approach is motivated by the hypothesis that if an miRNA and pathway are associated, then the expression of the miRNA and the collective behavior of the genes in a pathway will be correlated. As such, we first obtain an expression-based summary of pathway activity using Isomap, a dimension reduction method which can articulate non-linear structure in high-dimensional data. We then search for miRNAs that exhibit differential correlations with the pathway summary between phenotypes as a means of finding aberrant miRNA-pathway coregulation in tumors. We apply our method to cancer data using gene and miRNA expression datasets from The Cancer Genome Atlas and compare ∼105 miRNA-pathway relationships between healthy and tumor samples from four tissues (breast, prostate, lung and liver). Many of the flagged pairs we identify have a biological basis for disruption in cancer.
Collapse
Affiliation(s)
- Gary Wilk
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA
| | - Rosemary Braun
- Biostatistics Division, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.,Department of Engineering Sciences and Applied Mathematics, Northwestern University, Evanston, IL 60208, USA
| |
Collapse
|
96
|
Welch JD, Kozareva V, Ferreira A, Vanderburg C, Martin C, Macosko EZ. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell 2019; 177:1873-1887.e17. [PMID: 31178122 PMCID: PMC6716797 DOI: 10.1016/j.cell.2019.05.006] [Citation(s) in RCA: 721] [Impact Index Per Article: 120.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Revised: 02/21/2019] [Accepted: 04/30/2019] [Indexed: 02/07/2023]
Abstract
Defining cell types requires integrating diverse single-cell measurements from multiple experiments and biological contexts. To flexibly model single-cell datasets, we developed LIGER, an algorithm that delineates shared and dataset-specific features of cell identity. We applied it to four diverse and challenging analyses of human and mouse brain cells. First, we defined region-specific and sexually dimorphic gene expression in the mouse bed nucleus of the stria terminalis. Second, we analyzed expression in the human substantia nigra, comparing cell states in specific donors and relating cell types to those in the mouse. Third, we integrated in situ and single-cell expression data to spatially locate fine subtypes of cells present in the mouse frontal cortex. Finally, we jointly defined mouse cortical cell types using single-cell RNA-seq and DNA methylation profiles, revealing putative mechanisms of cell-type-specific epigenomic regulation. Integrative analyses using LIGER promise to accelerate investigations of cell-type definition, gene regulation, and disease states.
Collapse
Affiliation(s)
- Joshua D Welch
- Broad Institute of Harvard and MIT, Stanley Center for Psychiatric Research, 450 Main Street, Cambridge, MA, USA.
| | - Velina Kozareva
- Broad Institute of Harvard and MIT, Stanley Center for Psychiatric Research, 450 Main Street, Cambridge, MA, USA
| | - Ashley Ferreira
- Broad Institute of Harvard and MIT, Stanley Center for Psychiatric Research, 450 Main Street, Cambridge, MA, USA
| | - Charles Vanderburg
- Broad Institute of Harvard and MIT, Stanley Center for Psychiatric Research, 450 Main Street, Cambridge, MA, USA
| | - Carly Martin
- Broad Institute of Harvard and MIT, Stanley Center for Psychiatric Research, 450 Main Street, Cambridge, MA, USA
| | - Evan Z Macosko
- Broad Institute of Harvard and MIT, Stanley Center for Psychiatric Research, 450 Main Street, Cambridge, MA, USA; Massachusetts General Hospital, Department of Psychiatry, 55 Fruit Street, Boston, MA, USA.
| |
Collapse
|
97
|
Esposito F, Gillis N, Del Buono N. Orthogonal joint sparse NMF for microarray data analysis. J Math Biol 2019; 79:223-247. [PMID: 31004215 DOI: 10.1007/s00285-019-01355-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Revised: 03/29/2019] [Indexed: 12/20/2022]
Abstract
The 3D microarrays, generally known as gene-sample-time microarrays, couple the information on different time points collected by 2D microarrays that measure gene expression levels among different samples. Their analysis is useful in several biomedical applications, like monitoring dose or drug treatment responses of patients over time in pharmacogenomics studies. Many statistical and data analysis tools have been used to extract useful information. In particular, nonnegative matrix factorization (NMF), with its natural nonnegativity constraints, has demonstrated its ability to extract from 2D microarrays relevant information on specific genes involved in the particular biological process. In this paper, we propose a new NMF model, namely Orthogonal Joint Sparse NMF, to extract relevant information from 3D microarrays containing the time evolution of a 2D microarray, by adding additional constraints to enforce important biological proprieties useful for further biological analysis. We develop multiplicative updates rules that decrease the objective function monotonically, and compare our approach to state-of-the-art NMF algorithms on both synthetic and real data sets.
Collapse
Affiliation(s)
- Flavia Esposito
- Department of Mathematics, University of Bari Aldo Moro, via E. Orabona 4, 70125, Bari, Italy. .,INDAM Research Group GNCS, Roma, Italy.
| | - Nicolas Gillis
- Department of Mathematics and Operational Research, Université de Mons, Rue de Houdain 9, 7000, Mons, Belgium
| | - Nicoletta Del Buono
- Department of Mathematics, University of Bari Aldo Moro, via E. Orabona 4, 70125, Bari, Italy.,INDAM Research Group GNCS, Roma, Italy
| |
Collapse
|
98
|
Li Y, Wu FX, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinform 2019; 19:325-340. [PMID: 28011753 DOI: 10.1093/bib/bbw113] [Citation(s) in RCA: 135] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Indexed: 01/08/2023] Open
Abstract
Driven by high-throughput sequencing techniques, modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of vast volumes of heterogeneous information in the deep understanding of biological systems and the development of predictive models. How data from multiple sources (called multi-view data) are incorporated in a learning system is a key step for successful analysis. In this article, we provide a comprehensive review on omics and clinical data integration techniques, from a machine learning perspective, for various analyses such as prediction, clustering, dimension reduction and association. We shall show that Bayesian models are able to use prior information and model measurements with various distributions; tree-based methods can either build a tree with all features or collectively make a final decision based on trees learned from each view; kernel methods fuse the similarity matrices learned from individual views together for a final similarity matrix or learning model; network-based fusion methods are capable of inferring direct and indirect associations in a heterogeneous network; matrix factorization models have potential to learn interactions among features from different views; and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.
Collapse
Affiliation(s)
- Yifeng Li
- Information and Communications Technologies, National Research Council Canada, Ottawa, Ontario, Canada
| | - Fang-Xiang Wu
- Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Alioune Ngom
- School of Computer Science, University of Windsor, Windsor, Ontario, Canada
| |
Collapse
|
99
|
Systematical Identification of Breast Cancer-Related Circular RNA Modules for Deciphering circRNA Functions Based on the Non-Negative Matrix Factorization Algorithm. Int J Mol Sci 2019; 20:ijms20040919. [PMID: 30791568 PMCID: PMC6412941 DOI: 10.3390/ijms20040919] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2018] [Revised: 02/03/2019] [Accepted: 02/12/2019] [Indexed: 01/22/2023] Open
Abstract
Circular RNA (circRNA), a kind of special endogenous RNA, has been shown to be implicated in crucial biological processes of multiple cancers as a gene regulator. However, the functional roles of circRNAs in breast cancer (BC) remain to be poorly explored, and relatively incomplete knowledge of circRNAs handles the identification and prediction of BC-related circRNAs. Towards this end, we developed a systematic approach to identify circRNA modules in the BC context through integrating circRNA, mRNA, miRNA, and pathway data based on a non-negative matrix factorization (NMF) algorithm. Thirteen circRNA modules were uncovered by our approach, containing 4164 nodes (80 circRNAs, 2703 genes, 63 miRNAs and 1318 pathways) and 67,959 edges in total. GO (Gene Ontology) function screening identified nine circRNA functional modules with 44 circRNAs. Within them, 31 circRNAs in eight modules having direct relationships with known BC-related genes, miRNAs or disease-related pathways were selected as BC candidate circRNAs. Functional enrichment results showed that they were closely related with BC-associated pathways, such as ‘KEGG (Kyoto Encyclopedia of Genes and Genomes) PATHWAYS IN CANCER’, ‘REACTOME IMMUNE SYSTEM’ and ‘KEGG MAPK SIGNALING PATHWAY’, ‘KEGG P53 SIGNALING PATHWAY’ or ‘KEGG WNT SIGNALING PATHWAY’, and could sever as potential circRNA biomarkers in BC. Comparison results showed that our approach could identify more BC-related functional circRNA modules in performance. In summary, we proposed a novel systematic approach dependent on the known disease information of mRNA, miRNA and pathway to identify BC-related circRNA modules, which could help identify BC-related circRNAs and benefits treatment and prognosis for BC patients.
Collapse
|
100
|
Yang Z, Michailidis G. Quantifying heterogeneity of expression data based on principal components. Bioinformatics 2019; 35:553-559. [PMID: 30060088 DOI: 10.1093/bioinformatics/bty671] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Revised: 07/05/2018] [Accepted: 07/27/2018] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The diversity of biological omics data provides richness of information, but also presents an analytic challenge. While there has been much methodological and theoretical development on the statistical handling of large volumes of biological data, far less attention has been devoted to characterizing their veracity and variability. RESULTS We propose a method of statistically quantifying heterogeneity among multiple groups of datasets, derived from different omics modalities over various experimental and/or disease conditions. It draws upon strategies from analysis of variance and principal component analysis in order to reduce dimensionality of the variability across multiple data groups. The resulting hypothesis-based inference procedure is demonstrated with synthetic and real data from a cell line study of growth factor responsiveness based on a factorial experimental design. AVAILABILITY AND IMPLEMENTATION Source code and datasets are freely available at https://github.com/yangzi4/gPCA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zi Yang
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA
| | | |
Collapse
|