1
|
Xiao H, Wang J, Wan S. WIMOAD: Weighted Integration of Multi-Omics data for Alzheimer's Disease (AD) Diagnosis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.25.614862. [PMID: 39386613 PMCID: PMC11463407 DOI: 10.1101/2024.09.25.614862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]
Abstract
As the most common subtype of dementia, Alzheimer's disease (AD) is characterized by a progressive decline in cognitive functions, especially in memory, thinking, and reasoning ability. Early diagnosis and interventions enable the implementation of measures to reduce or slow further regression of the disease, preventing individuals from severe brain function decline. The current framework of AD diagnosis depends on A/T/(N) biomarkers detection from cerebrospinal fluid or brain imaging data, which is invasive and expensive during the data acquisition process. Moreover, the pathophysiological changes of AD accumulate in amino acids, metabolism, neuroinflammation, etc., resulting in heterogeneity in newly registered patients. Recently, next generation sequencing (NGS) technologies have found to be a non-invasive, efficient and less-costly alternative on AD screening. However, most of existing studies rely on single omics only. To address these concerns, we introduce WIMOAD, a weighted integration of multi-omics data for AD diagnosis. WIMOAD synergistically leverages specialized classifiers for patients' paired gene expression and methylation data for multi-stage classification. The resulting scores were then stacked with MLP-based meta-models for performance improvement. The prediction results of two distinct meta-models were integrated with optimized weights for the final decision-making of the model, providing higher performance than using single omics only. Remarkably, WIMOAD achieves significantly higher performance than using single omics alone in the classification tasks. The model's overall performance also outperformed most existing approaches, highlighting its ability to effectively discern intricate patterns in multi-omics data and their correlations with clinical diagnosis results. In addition, WIMOAD also stands out as a biologically interpretable model by leveraging the SHapley Additive exPlanations (SHAP) to elucidate the contributions of each gene from each omics to the model output. We believe WIMOAD is a very promising tool for accurate AD diagnosis and effective biomarker discovery across different progression stages, which eventually will have consequential impacts on early treatment intervention and personalized therapy design on AD.
Collapse
Affiliation(s)
- Hanyu Xiao
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, United States, 68198
| | - Jieqiong Wang
- Department of Neurological Sciences, University of Nebraska Medical Center, Omaha, NE, United States, 68198
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, United States, 68198
| |
Collapse
|
2
|
Bao J, Chang C, Zhang Q, Saykin AJ, Shen L, Long Q, for the Alzheimer’s Disease Neuroimaging Initiative. Integrative analysis of multi-omics and imaging data with incorporation of biological information via structural Bayesian factor analysis. Brief Bioinform 2023; 24:bbad073. [PMID: 36882008 PMCID: PMC10387302 DOI: 10.1093/bib/bbad073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 01/14/2023] [Accepted: 02/10/2023] [Indexed: 03/09/2023] Open
Abstract
MOTIVATION With the rapid development of modern technologies, massive data are available for the systematic study of Alzheimer's disease (AD). Though many existing AD studies mainly focus on single-modality omics data, multi-omics datasets can provide a more comprehensive understanding of AD. To bridge this gap, we proposed a novel structural Bayesian factor analysis framework (SBFA) to extract the information shared by multi-omics data through the aggregation of genotyping data, gene expression data, neuroimaging phenotypes and prior biological network knowledge. Our approach can extract common information shared by different modalities and encourage biologically related features to be selected, guiding future AD research in a biologically meaningful way. METHOD Our SBFA model decomposes the mean parameters of the data into a sparse factor loading matrix and a factor matrix, where the factor matrix represents the common information extracted from multi-omics and imaging data. Our framework is designed to incorporate prior biological network information. Our simulation study demonstrated that our proposed SBFA framework could achieve the best performance compared with the other state-of-the-art factor-analysis-based integrative analysis methods. RESULTS We apply our proposed SBFA model together with several state-of-the-art factor analysis models to extract the latent common information from genotyping, gene expression and brain imaging data simultaneously from the ADNI biobank database. The latent information is then used to predict the functional activities questionnaire score, an important measurement for diagnosis of AD quantifying subjects' abilities in daily life. Our SBFA model shows the best prediction performance compared with the other factor analysis models. AVAILABILITY Code are publicly available at https://github.com/JingxuanBao/SBFA. CONTACT qlong@upenn.edu.
Collapse
Affiliation(s)
- Jingxuan Bao
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, 19104, PA, USA
| | - Changgee Chang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, 19104, PA, USA
| | - Qiyiwen Zhang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, 19104, PA, USA
| | - Andrew J Saykin
- Department of Radiology and Imaging Sciences, Indiana University, Indianapolis, 46202, IN, USA
| | - Li Shen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, 19104, PA, USA
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, 19104, PA, USA
| | | |
Collapse
|
3
|
Mallik S, Sarkar A, Nath S, Maulik U, Das S, Pati SK, Ghosh S, Zhao Z. 3PNMF-MKL: A non-negative matrix factorization-based multiple kernel learning method for multi-modal data integration and its application to gene signature detection. Front Genet 2023; 14:1095330. [PMID: 36865387 PMCID: PMC9971618 DOI: 10.3389/fgene.2023.1095330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Accepted: 01/30/2023] [Indexed: 02/16/2023] Open
Abstract
In this current era, biomedical big data handling is a challenging task. Interestingly, the integration of multi-modal data, followed by significant feature mining (gene signature detection), becomes a daunting task. Remembering this, here, we proposed a novel framework, namely, three-factor penalized, non-negative matrix factorization-based multiple kernel learning with soft margin hinge loss (3PNMF-MKL) for multi-modal data integration, followed by gene signature detection. In brief, limma, employing the empirical Bayes statistics, was initially applied to each individual molecular profile, and the statistically significant features were extracted, which was followed by the three-factor penalized non-negative matrix factorization method used for data/matrix fusion using the reduced feature sets. Multiple kernel learning models with soft margin hinge loss had been deployed to estimate average accuracy scores and the area under the curve (AUC). Gene modules had been identified by the consecutive analysis of average linkage clustering and dynamic tree cut. The best module containing the highest correlation was considered the potential gene signature. We utilized an acute myeloid leukemia cancer dataset from The Cancer Genome Atlas (TCGA) repository containing five molecular profiles. Our algorithm generated a 50-gene signature that achieved a high classification AUC score (viz., 0.827). We explored the functions of signature genes using pathway and Gene Ontology (GO) databases. Our method outperformed the state-of-the-art methods in terms of computing AUC. Furthermore, we included some comparative studies with other related methods to enhance the acceptability of our method. Finally, it can be notified that our algorithm can be applied to any multi-modal dataset for data integration, followed by gene module discovery.
Collapse
Affiliation(s)
- Saurav Mallik
- Department of Environmental Health, Harvard T H Chan School of public Health, Boston, MA, United States,*Correspondence: Saurav Mallik, , ; Zhongming Zhao,
| | - Anasua Sarkar
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
| | - Sagnik Nath
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
| | - Ujjwal Maulik
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
| | - Supantha Das
- Department of Information Technology, Academy of Technology, Hooghly, West Bengal, India
| | - Soumen Kumar Pati
- Department of Bioinformatics, Maulana Abul Kalam Azad University, Kolkata, West Bengal, India
| | - Soumadip Ghosh
- Department of Computer Science & Engineering, Sister Nivedita University, New Town, West Bengal, India
| | - Zhongming Zhao
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States,Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States,*Correspondence: Saurav Mallik, , ; Zhongming Zhao,
| |
Collapse
|
4
|
Mahendran N, Vincent P M DR. Deep belief network-based approach for detecting Alzheimer's disease using the multi-omics data. Comput Struct Biotechnol J 2023; 21:1651-1660. [PMID: 36874164 PMCID: PMC9978469 DOI: 10.1016/j.csbj.2023.02.021] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 02/10/2023] [Accepted: 02/11/2023] [Indexed: 02/15/2023] Open
Abstract
Alzheimer's disease (AD) is the most uncertain form of Dementia in terms of finding out the mechanism. AD does not have a vital genetic factor to relate to. There were no reliable techniques and methods to identify the genetic risk factors associated with AD in the past. Most of the data available were from the brain images. However, recently, there have been drastic advancements in the high-throughput techniques in bioinformatics. It has led to focused researches in discovering the AD causing genetic risk factors. Recent analysis has resulted in considerable prefrontal cortex data with which classification and prediction models can be developed for AD. We have developed a Deep Belief Network-based prediction model using the DNA Methylation and Gene Expression Microarray Data, with High Dimension Low Sample Size (HDLSS) issues. To overcome the HDLSS challenge, we performed a two-layer feature selection considering the biological aspects of the features as well. In the two-layered feature selection approach, first the differentially expressed genes and differentially methylated positions are identified, then both the datasets are combined using Jaccard similarity measure. As the second step, an ensemble-based feature selection approach is implemented to further narrow down the gene selection. The results show that the proposed feature selection technique outperforms the existing commonly used feature selection techniques, such as Support Vector Machine Recursive Feature Elimination (SVM-RFE), and Correlation-based Feature Selection (CBS). Furthermore, the Deep Belief Network-based prediction model performs better than the widely used Machine Learning models. Also, the multi-omics dataset shows promising results compared to the single omics.
Collapse
Affiliation(s)
- Nivedhitha Mahendran
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Durai Raj Vincent P M
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| |
Collapse
|
5
|
Flores JE, Claborne DM, Weller ZD, Webb-Robertson BJM, Waters KM, Bramer LM. Missing data in multi-omics integration: Recent advances through artificial intelligence. Front Artif Intell 2023; 6:1098308. [PMID: 36844425 PMCID: PMC9949722 DOI: 10.3389/frai.2023.1098308] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 01/23/2023] [Indexed: 02/11/2023] Open
Abstract
Biological systems function through complex interactions between various 'omics (biomolecules), and a more complete understanding of these systems is only possible through an integrated, multi-omic perspective. This has presented the need for the development of integration approaches that are able to capture the complex, often non-linear, interactions that define these biological systems and are adapted to the challenges of combining the heterogenous data across 'omic views. A principal challenge to multi-omic integration is missing data because all biomolecules are not measured in all samples. Due to either cost, instrument sensitivity, or other experimental factors, data for a biological sample may be missing for one or more 'omic techologies. Recent methodological developments in artificial intelligence and statistical learning have greatly facilitated the analyses of multi-omics data, however many of these techniques assume access to completely observed data. A subset of these methods incorporate mechanisms for handling partially observed samples, and these methods are the focus of this review. We describe recently developed approaches, noting their primary use cases and highlighting each method's approach to handling missing data. We additionally provide an overview of the more traditional missing data workflows and their limitations; and we discuss potential avenues for further developments as well as how the missing data issue and its current solutions may generalize beyond the multi-omics context.
Collapse
Affiliation(s)
- Javier E. Flores
- Pacific Northwest National Laboratory, Biological Sciences Division, Earth and Biological Sciences Directorate, Richland, WA, United States
| | - Daniel M. Claborne
- Pacific Northwest National Laboratory, Artificial Intelligence and Data Analytics Division, National Security Directorate, Richland, WA, United States
| | - Zachary D. Weller
- Pacific Northwest National Laboratory, Artificial Intelligence and Data Analytics Division, National Security Directorate, Richland, WA, United States
| | - Bobbie-Jo M. Webb-Robertson
- Pacific Northwest National Laboratory, Biological Sciences Division, Earth and Biological Sciences Directorate, Richland, WA, United States
| | - Katrina M. Waters
- Pacific Northwest National Laboratory, Biological Sciences Division, Earth and Biological Sciences Directorate, Richland, WA, United States
| | - Lisa M. Bramer
- Pacific Northwest National Laboratory, Biological Sciences Division, Earth and Biological Sciences Directorate, Richland, WA, United States
| |
Collapse
|
6
|
Jihad M, Yet İ. Multiomics Integration at Single-Cell Resolution Using Bayesian Networks: A Case Study in Hepatocellular Carcinoma. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2023; 27:24-33. [PMID: 36602810 DOI: 10.1089/omi.2022.0170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Multiomics data integration is one of the leading frontiers of complex disease research and integrative biology. The advances in single-cell sequencing technologies offer yet another crucial dimension in multiomics research. The single-cell studies enable the study and integration of multiomics data simultaneously in the same cell. We report in this study multiomics data integration in single-cell resolution using Bayesian networks (BNs) in a case study of hepatocellular carcinoma (HCC). A BN encodes the conditional dependencies/independencies of variables using a graphical model with an accompanying joint probability. RNA-seq and Reduced Representation Bisulfite Sequencing data were analyzed separately, and copy number variations were estimated by the hidden Markov model method. Several BN models were constructed to reveal omics' causal and associational relationships. These methods were subjected to a validation study using an independent data set. We show the heterogeneity of the multiple cellular layers of HCC at single-cell omics resolution by identifying best-fitted BN models of 295 genes. We also provide novel insights into the multiomics mechanistic relationships in the human lymphocyte antigen class I genes in HCC. To the best of our knowledge, this is the first study to focus on integrating omics data using a machine learning algorithm, BNs, at the single-cell resolution using a case study of HCC.
Collapse
Affiliation(s)
- Muntadher Jihad
- Department of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, Ankara, Turkey
| | - İdil Yet
- Department of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, Ankara, Turkey
| |
Collapse
|
7
|
Alfatemi A, Peng H, Rong W, Zhang B, Cai H. Patient subgrouping with distinct survival rates via integration of multiomics data on a Grassmann manifold. BMC Med Inform Decis Mak 2022; 22:190. [PMID: 35870923 PMCID: PMC9308936 DOI: 10.1186/s12911-022-01938-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Accepted: 07/15/2022] [Indexed: 11/10/2022] Open
Abstract
Background Patient subgroups are important for easily understanding a disease and for providing precise yet personalized treatment through multiple omics dataset integration. Multiomics datasets are produced daily. Thus, the fusion of heterogeneous big data into intrinsic structures is an urgent problem. Novel mathematical methods are needed to process these data in a straightforward way. Results We developed a novel method for subgrouping patients with distinct survival rates via the integration of multiple omics datasets and by using principal component analysis to reduce the high data dimensionality. Then, we constructed similarity graphs for patients, merged the graphs in a subspace, and analyzed them on a Grassmann manifold. The proposed method could identify patient subgroups that had not been reported previously by selecting the most critical information during the merging at each level of the omics dataset. Our method was tested on empirical multiomics datasets from The Cancer Genome Atlas. Conclusion Through the integration of microRNA, gene expression, and DNA methylation data, our method accurately identified patient subgroups and achieved superior performance compared with popular methods. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-022-01938-y.
Collapse
|
8
|
Zhanpeng H, Jiekang W. A Multiview Clustering Method With Low-Rank and Sparsity Constraints for Cancer Subtyping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3213-3223. [PMID: 34705654 DOI: 10.1109/tcbb.2021.3122917] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Multiomics data clustering is one of the major challenges in the field of precision medicine. Integration of multiomics data for cancer subtyping can improve the understanding on cancer and reveal systems-level insights. How to integrate multiomics data for accurate cancer subtyping is an interesting and challenging research problem. To capture the global and the local structure of omics data, a novel framework for integrating multiomics data is proposed for cancer subtyping. Multiview clustering with low-rank and sparsity constraints (MVCLRS) can measure the local similarities of samples in each omics data and obtain global consensus structures by integrating the multiomics data. The main insight provided by MVCLRS is that low-rank sparse subspace clustering for the construction of an affinity matrix can best capture the local similarities in omics data. Extensive testing is conducted on 10 real world cancer datasets with multiomics from The Cancer Genome Atlas. Compared with 10 state-of-the-art multiomics clustering algorithms, the MVCLRS performs better in the 10 cancer datasets by providing its clustering results with at least one enriched clinical label in nine of ten cancer subtypes, the most of any method.
Collapse
|
9
|
Suter P, Dazert E, Kuipers J, Ng CKY, Boldanova T, Hall MN, Heim MH, Beerenwinkel N. Multi-omics subtyping of hepatocellular carcinoma patients using a Bayesian network mixture model. PLoS Comput Biol 2022; 18:e1009767. [PMID: 36067230 PMCID: PMC9481159 DOI: 10.1371/journal.pcbi.1009767] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 09/16/2022] [Accepted: 07/18/2022] [Indexed: 11/18/2022] Open
Abstract
Comprehensive molecular characterization of cancer subtypes is essential for predicting clinical outcomes and searching for personalized treatments. We present bnClustOmics, a statistical model and computational tool for multi-omics unsupervised clustering, which serves a dual purpose: Clustering patient samples based on a Bayesian network mixture model and learning the networks of omics variables representing these clusters. The discovered networks encode interactions among all omics variables and provide a molecular characterization of each patient subgroup. We conducted simulation studies that demonstrated the advantages of our approach compared to other clustering methods in the case where the generative model is a mixture of Bayesian networks. We applied bnClustOmics to a hepatocellular carcinoma (HCC) dataset comprising genome (mutation and copy number), transcriptome, proteome, and phosphoproteome data. We identified three main HCC subtypes together with molecular characteristics, some of which are associated with survival even when adjusting for the clinical stage. Cluster-specific networks shed light on the links between genotypes and molecular phenotypes of samples within their respective clusters and suggest targets for personalized treatments.
Collapse
Affiliation(s)
- Polina Suter
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Eva Dazert
- Biozentrum, University of Basel, Basel, Switzerland
| | - Jack Kuipers
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Charlotte K. Y. Ng
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Department for BioMedical Research (DBMR), University of Bern, Bern, Switzerland
- Department of Biomedicine, University Hospital Basel, University of Basel, Basel, Switzerland
- Institute of Medical Genetics and Pathology, University Hospital Basel, University of Basel, Basel, Switzerland
| | - Tuyana Boldanova
- Department of Biomedicine, University Hospital Basel, University of Basel, Basel, Switzerland
| | | | - Markus H. Heim
- Department of Biomedicine, University Hospital Basel, University of Basel, Basel, Switzerland
- Department of Gastroenterology and Hepatology, Clarunis, University Center for Gastrointestinal and Liver Diseases, Basel, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- * E-mail:
| |
Collapse
|
10
|
Zhang X, Zhou Z, Xu H, Liu CT. Integrative clustering methods for multi-omics data. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2022; 14. [PMID: 35573155 PMCID: PMC9097984 DOI: 10.1002/wics.1553] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Integrative analysis of multi-omics data has drawn much attention from the scientific community due to the technological advancements which have generated various omics data. Leveraging these multi-omics data potentially provides a more comprehensive view of the disease mechanism or biological processes. Integrative multi-omics clustering is an unsupervised integrative method specifically used to find coherent groups of samples or features by utilizing information across multi-omics data. It aims to better stratify diseases and to suggest biological mechanisms and potential targeted therapies for the diseases. However, applying integrative multi-omics clustering is both statistically and computationally challenging due to various reasons such as high dimensionality and heterogeneity. In this review, we summarized integrative multi-omics clustering methods into three general categories: concatenated clustering, clustering of clusters, and interactive clustering based on when and how the multi-omics data are processed for clustering. We further classified the methods into different approaches under each category based on the main statistical strategy used during clustering. In addition, we have provided recommended practices tailored to four real-life scenarios to help researchers to strategize their selection in integrative multi-omics clustering methods for their future studies.
Collapse
Affiliation(s)
- Xiaoyu Zhang
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Zhenwei Zhou
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Hanfei Xu
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - Ching-Ti Liu
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
11
|
Kang M, Ko E, Mersha TB. A roadmap for multi-omics data integration using deep learning. Brief Bioinform 2022; 23:bbab454. [PMID: 34791014 PMCID: PMC8769688 DOI: 10.1093/bib/bbab454] [Citation(s) in RCA: 138] [Impact Index Per Article: 46.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 09/30/2021] [Accepted: 10/05/2021] [Indexed: 12/18/2022] Open
Abstract
High-throughput next-generation sequencing now makes it possible to generate a vast amount of multi-omics data for various applications. These data have revolutionized biomedical research by providing a more comprehensive understanding of the biological systems and molecular mechanisms of disease development. Recently, deep learning (DL) algorithms have become one of the most promising methods in multi-omics data analysis, due to their predictive performance and capability of capturing nonlinear and hierarchical features. While integrating and translating multi-omics data into useful functional insights remain the biggest bottleneck, there is a clear trend towards incorporating multi-omics analysis in biomedical research to help explain the complex relationships between molecular layers. Multi-omics data have a role to improve prevention, early detection and prediction; monitor progression; interpret patterns and endotyping; and design personalized treatments. In this review, we outline a roadmap of multi-omics integration using DL and offer a practical perspective into the advantages, challenges and barriers to the implementation of DL in multi-omics data.
Collapse
Affiliation(s)
- Mingon Kang
- Department of Computer Science at the University of Nevada, Las Vegas, NV, USA
| | - Euiseong Ko
- Department of Computer Science at the University of Nevada, Las Vegas, NV, USA
| | - Tesfaye B Mersha
- Department of Pediatrics, Cincinnati Children’s Hospital Medical Center, University of Cincinnati, Cincinnati, OH, USA
| |
Collapse
|
12
|
Duan R, Gao L, Gao Y, Hu Y, Xu H, Huang M, Song K, Wang H, Dong Y, Jiang C, Zhang C, Jia S. Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS Comput Biol 2021; 17:e1009224. [PMID: 34383739 PMCID: PMC8384175 DOI: 10.1371/journal.pcbi.1009224] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Revised: 08/24/2021] [Accepted: 06/28/2021] [Indexed: 11/18/2022] Open
Abstract
Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis. Cancer is one of the most heterogeneous diseases, characterized by diverse morphological, phenotypic, and genomic profiles between tumors and their subtypes. Identifying cancer subtypes can help patients receive precise treatments. With the development of high-throughput technologies, genomics, epigenomics, and transcriptomics data have been generated for large cancer patient cohorts. It is believed that the more omics data we use, the more accurate identification of cancer subtypes. To examine this assumption, we first constructed three classes of benchmarking datasets to conduct a comprehensive evaluation and comparison of ten representative multi-omics data integration methods for cancer subtyping by considering their accuracy, robustness, and computational efficiency. Then, we investigated the influence of different omics data and their various combinations on the effectiveness of cancer subtyping. Our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. We hope that our work may help researchers choose a proper method and an effective data combination when identifying cancer subtypes using data integration methods.
Collapse
Affiliation(s)
- Ran Duan
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi’an, China
- * E-mail:
| | - Yong Gao
- Department of Computer Science, The University of British Columbia Okanagan, Kelowna, British Columbia, Canada
| | - Yuxuan Hu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Han Xu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Mingfeng Huang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Kuo Song
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Hongda Wang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Yongqiang Dong
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Chaoqun Jiang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Chenxing Zhang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Songwei Jia
- School of Computer Science and Technology, Xidian University, Xi’an, China
| |
Collapse
|
13
|
Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief Bioinform 2021; 22:bbaa169. [PMID: 32793970 PMCID: PMC8138889 DOI: 10.1093/bib/bbaa169] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/18/2020] [Accepted: 07/04/2020] [Indexed: 12/12/2022] Open
Abstract
Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a 'lack of information' problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Huangdi Yi
- Department of Biostatistics at Yale University
| | - Shuangge Ma
- Department of Biostatistics at Yale University
| |
Collapse
|
14
|
A New Era of Neuro-Oncology Research Pioneered by Multi-Omics Analysis and Machine Learning. Biomolecules 2021; 11:biom11040565. [PMID: 33921457 PMCID: PMC8070530 DOI: 10.3390/biom11040565] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/02/2021] [Accepted: 04/07/2021] [Indexed: 02/06/2023] Open
Abstract
Although the incidence of central nervous system (CNS) cancers is not high, it significantly reduces a patient’s quality of life and results in high mortality rates. A low incidence also means a low number of cases, which in turn means a low amount of information. To compensate, researchers have tried to increase the amount of information available from a single test using high-throughput technologies. This approach, referred to as single-omics analysis, has only been partially successful as one type of data may not be able to appropriately describe all the characteristics of a tumor. It is presently unclear what type of data can describe a particular clinical situation. One way to solve this problem is to use multi-omics data. When using many types of data, a selected data type or a combination of them may effectively resolve a clinical question. Hence, we conducted a comprehensive survey of papers in the field of neuro-oncology that used multi-omics data for analysis and found that most of the papers utilized machine learning techniques. This fact shows that it is useful to utilize machine learning techniques in multi-omics analysis. In this review, we discuss the current status of multi-omics analysis in the field of neuro-oncology and the importance of using machine learning techniques.
Collapse
|
15
|
Wen Y, Song X, Yan B, Yang X, Wu L, Leng D, He S, Bo X. Multi-dimensional data integration algorithm based on random walk with restart. BMC Bioinformatics 2021; 22:97. [PMID: 33639858 PMCID: PMC7912853 DOI: 10.1186/s12859-021-04029-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 02/15/2021] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND The accumulation of various multi-omics data and computational approaches for data integration can accelerate the development of precision medicine. However, the algorithm development for multi-omics data integration remains a pressing challenge. RESULTS Here, we propose a multi-omics data integration algorithm based on random walk with restart (RWR) on multiplex network. We call the resulting methodology Random Walk with Restart for multi-dimensional data Fusion (RWRF). RWRF uses similarity network of samples as the basis for integration. It constructs the similarity network for each data type and then connects corresponding samples of multiple similarity networks to create a multiplex sample network. By applying RWR on the multiplex network, RWRF uses stationary probability distribution to fuse similarity networks. We applied RWRF to The Cancer Genome Atlas (TCGA) data to identify subtypes in different cancer data sets. Three types of data (mRNA expression, DNA methylation, and microRNA expression data) are integrated and network clustering is conducted. Experiment results show that RWRF performs better than single data type analysis and previous integrative methods. CONCLUSIONS RWRF provides powerful support to users to decipher the cancer molecular subtypes, thus may benefit precision treatment of specific patients in clinical practice.
Collapse
Affiliation(s)
- Yuqi Wen
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China
| | - Xinyu Song
- Department of Biomedical Engineering, Chinese PLA General Hospital, Beijing, 100853, People's Republic of China
| | - Bowei Yan
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China
| | - Xiaoxi Yang
- Experimental Center, Beijing Friendship Hospital, Capital Medical University, Beijing, 100069, People's Republic of China
| | - Lianlian Wu
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China.,Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, 300072, People's Republic of China
| | - Dongjin Leng
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China
| | - Song He
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China.
| | - Xiaochen Bo
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China.
| |
Collapse
|
16
|
Biswas N, Chakrabarti S. Artificial Intelligence (AI)-Based Systems Biology Approaches in Multi-Omics Data Analysis of Cancer. Front Oncol 2020; 10:588221. [PMID: 33154949 PMCID: PMC7591760 DOI: 10.3389/fonc.2020.588221] [Citation(s) in RCA: 65] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 09/21/2020] [Indexed: 12/13/2022] Open
Abstract
Cancer is the manifestation of abnormalities of different physiological processes involving genes, DNAs, RNAs, proteins, and other biomolecules whose profiles are reflected in different omics data types. As these bio-entities are very much correlated, integrative analysis of different types of omics data, multi-omics data, is required to understanding the disease from the tumorigenesis to the disease progression. Artificial intelligence (AI), specifically machine learning algorithms, has the ability to make decisive interpretation of "big"-sized complex data and, hence, appears as the most effective tool for the analysis and understanding of multi-omics data for patient-specific observations. In this review, we have discussed about the recent outcomes of employing AI in multi-omics data analysis of different types of cancer. Based on the research trends and significance in patient treatment, we have primarily focused on the AI-based analysis for determining cancer subtypes, disease prognosis, and therapeutic targets. We have also discussed about AI analysis of some non-canonical types of omics data as they have the capability of playing the determiner role in cancer patient care. Additionally, we have briefly discussed about the data repositories because of their pivotal role in multi-omics data storing, processing, and analysis.
Collapse
Affiliation(s)
- Nupur Biswas
- Structural Biology and Bioinformatics Division, CSIR-Indian Institute of Chemical Biology, IICB TRUE Campus, Kolkata, India
| | - Saikat Chakrabarti
- Structural Biology and Bioinformatics Division, CSIR-Indian Institute of Chemical Biology, IICB TRUE Campus, Kolkata, India
| |
Collapse
|
17
|
Sturchio A, Marsili L, Vizcarra JA, Dwivedi AK, Kauffman MA, Duker AP, Lu P, Pauciulo MW, Wissel BD, Hill EJ, Stecher B, Keeling EG, Vagal AS, Wang L, Haslam DB, Robson MJ, Tanner CM, Hagey DW, El Andaloussi S, Ezzat K, Fleming RMT, Lu LJ, Little MA, Espay AJ. Phenotype-Agnostic Molecular Subtyping of Neurodegenerative Disorders: The Cincinnati Cohort Biomarker Program (CCBP). Front Aging Neurosci 2020; 12:553635. [PMID: 33132895 PMCID: PMC7578373 DOI: 10.3389/fnagi.2020.553635] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Accepted: 09/10/2020] [Indexed: 12/16/2022] Open
Abstract
Ongoing biomarker development programs have been designed to identify serologic or imaging signatures of clinico-pathologic entities, assuming distinct biological boundaries between them. Identified putative biomarkers have exhibited large variability and inconsistency between cohorts, and remain inadequate for selecting suitable recipients for potential disease-modifying interventions. We launched the Cincinnati Cohort Biomarker Program (CCBP) as a population-based, phenotype-agnostic longitudinal study. While patients affected by a wide range of neurodegenerative disorders will be deeply phenotyped using clinical, imaging, and mobile health technologies, analyses will not be anchored on phenotypic clusters but on bioassays of to-be-repurposed medications as well as on genomics, transcriptomics, proteomics, metabolomics, epigenomics, microbiomics, and pharmacogenomics analyses blinded to phenotypic data. Unique features of this cohort study include (1) a reverse biology-to-phenotype direction of biomarker development in which clinical, imaging, and mobile health technologies are subordinate to biological signals of interest; (2) hypothesis free, causally- and data driven-based analyses; (3) inclusive recruitment of patients with neurodegenerative disorders beyond clinical criteria-meeting patients with Parkinson's and Alzheimer's diseases, and (4) a large number of longitudinally followed participants. The parallel development of serum bioassays will be aimed at linking biologically suitable subjects to already available drugs with repurposing potential in future proof-of-concept adaptive clinical trials. Although many challenges are anticipated, including the unclear pathogenic relevance of identifiable biological signals and the possibility that some signals of importance may not yet be measurable with current technologies, this cohort study abandons the anchoring role of clinico-pathologic criteria in favor of biomarker-driven disease subtyping to facilitate future biosubtype-specific disease-modifying therapeutic efforts.
Collapse
Affiliation(s)
- Andrea Sturchio
- James J. and Joan A. Gardner Family Center for Parkinson’s disease and Movement Disorders, Department of Neurology, University of Cincinnati, Cincinnati, OH, United States
| | - Luca Marsili
- James J. and Joan A. Gardner Family Center for Parkinson’s disease and Movement Disorders, Department of Neurology, University of Cincinnati, Cincinnati, OH, United States
| | - Joaquin A. Vizcarra
- James J. and Joan A. Gardner Family Center for Parkinson’s disease and Movement Disorders, Department of Neurology, University of Cincinnati, Cincinnati, OH, United States
| | - Alok K. Dwivedi
- Division of Biostatistics and Epidemiology, Department of Biomedical Sciences, Paul L. Foster School of Medicine, Texas Tech University Health Sciences Center, El Paso, TX, United States
| | - Marcelo A. Kauffman
- Consultorio y Laboratorio de Neurogenética, Centro Universitario de Neurología “José María Ramos Mejía” y División Neurología, Hospital JM Ramos Mejía, Facultad de Medicina, Universidad de Buenos Aires, Buenos Aires, Argentina
- Programa de Medicina de Precision y Genomica Clinica, Instituto de Investigaciones en Medicina Traslacional, Facultad de Ciencias Biomédicas, Universidad Austral– Consejo Nacional de Investigaciones Científicas y Técnicas de Argentina, Pilar, Argentina
| | - Andrew P. Duker
- James J. and Joan A. Gardner Family Center for Parkinson’s disease and Movement Disorders, Department of Neurology, University of Cincinnati, Cincinnati, OH, United States
| | - Peixin Lu
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Department of Pediatrics, University of Cincinnati, Cincinnati, OH, United States
- School of Information Management, Wuhan University, Wuhan, China
| | - Michael W. Pauciulo
- Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Department of Pediatrics, University of Cincinnati, Cincinnati, OH, United States
| | - Benjamin D. Wissel
- James J. and Joan A. Gardner Family Center for Parkinson’s disease and Movement Disorders, Department of Neurology, University of Cincinnati, Cincinnati, OH, United States
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Department of Pediatrics, University of Cincinnati, Cincinnati, OH, United States
| | - Emily J. Hill
- James J. and Joan A. Gardner Family Center for Parkinson’s disease and Movement Disorders, Department of Neurology, University of Cincinnati, Cincinnati, OH, United States
| | - Benjamin Stecher
- James J. and Joan A. Gardner Family Center for Parkinson’s disease and Movement Disorders, Department of Neurology, University of Cincinnati, Cincinnati, OH, United States
| | - Elizabeth G. Keeling
- James J. and Joan A. Gardner Family Center for Parkinson’s disease and Movement Disorders, Department of Neurology, University of Cincinnati, Cincinnati, OH, United States
| | - Achala S. Vagal
- Department of Radiology, University of Cincinnati Medical Center, Cincinnati, OH, United States
| | - Lily Wang
- Department of Radiology, University of Cincinnati Medical Center, Cincinnati, OH, United States
| | - David B. Haslam
- Division of Infectious Diseases, Center for Inflammation and Tolerance, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, United States
| | - Matthew J. Robson
- Division of Pharmaceutical Sciences, James L. Winkle College of Pharmacy, University of Cincinnati, Cincinnati, Cincinnati, OH, United States
| | - Caroline M. Tanner
- Department of Neurology, Weill Institute for Neurosciences, Parkinson’s Disease Research Education and Clinical Center, San Francisco Veteran’s Affairs Medical Center, University of California, San Francisco, San Francisco, CA, United States
| | - Daniel W. Hagey
- Department of Laboratory Medicine, Clinical Research Center, Karolinska Institutet, Stockholm, Sweden
| | - Samir El Andaloussi
- Department of Laboratory Medicine, Clinical Research Center, Karolinska Institutet, Stockholm, Sweden
| | - Kariem Ezzat
- Department of Laboratory Medicine, Clinical Research Center, Karolinska Institutet, Stockholm, Sweden
| | - Ronan M. T. Fleming
- Analytical Biosciences, Division of Systems Biomedicine and Pharmacology, Leiden Academic Centre for Drug Research, Leiden University, Leiden, Netherlands
| | - Long J. Lu
- Programa de Medicina de Precision y Genomica Clinica, Instituto de Investigaciones en Medicina Traslacional, Facultad de Ciencias Biomédicas, Universidad Austral– Consejo Nacional de Investigaciones Científicas y Técnicas de Argentina, Pilar, Argentina
| | - Max A. Little
- School of Computer Science, University of Birmingham, Birmingham, United Kingdom
- Media Lab, Massachusetts Institute of Technology, Cambridge, MA, United States
| | - Alberto J. Espay
- James J. and Joan A. Gardner Family Center for Parkinson’s disease and Movement Disorders, Department of Neurology, University of Cincinnati, Cincinnati, OH, United States
| |
Collapse
|
18
|
Nicora G, Vitali F, Dagliati A, Geifman N, Bellazzi R. Integrated Multi-Omics Analyses in Oncology: A Review of Machine Learning Methods and Tools. Front Oncol 2020; 10:1030. [PMID: 32695678 PMCID: PMC7338582 DOI: 10.3389/fonc.2020.01030] [Citation(s) in RCA: 129] [Impact Index Per Article: 25.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Accepted: 05/26/2020] [Indexed: 12/16/2022] Open
Abstract
In recent years, high-throughput sequencing technologies provide unprecedented opportunity to depict cancer samples at multiple molecular levels. The integration and analysis of these multi-omics datasets is a crucial and critical step to gain actionable knowledge in a precision medicine framework. This paper explores recent data-driven methodologies that have been developed and applied to respond major challenges of stratified medicine in oncology, including patients' phenotyping, biomarker discovery, and drug repurposing. We systematically retrieved peer-reviewed journals published from 2014 to 2019, select and thoroughly describe the tools presenting the most promising innovations regarding the integration of heterogeneous data, the machine learning methodologies that successfully tackled the complexity of multi-omics data, and the frameworks to deliver actionable results for clinical practice. The review is organized according to the applied methods: Deep learning, Network-based methods, Clustering, Features Extraction, and Transformation, Factorization. We provide an overview of the tools available in each methodological group and underline the relationship among the different categories. Our analysis revealed how multi-omics datasets could be exploited to drive precision oncology, but also current limitations in the development of multi-omics data integration.
Collapse
Affiliation(s)
- Giovanna Nicora
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| | - Francesca Vitali
- Center for Innovation in Brain Science, University of Arizona, Tucson, AZ, United States.,Department of Neurology, College of Medicine, University of Arizona, Tucson, AZ, United States.,Center for Biomedical Informatics and Biostatistics, University of Arizona, Tucson, AZ, United States
| | - Arianna Dagliati
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.,Centre for Health Informatics, The University of Manchester, Manchester, United Kingdom.,The Manchester Molecular Pathology Innovation Centre, The University of Manchester, Manchester, United Kingdom
| | - Nophar Geifman
- Centre for Health Informatics, The University of Manchester, Manchester, United Kingdom.,The Manchester Molecular Pathology Innovation Centre, The University of Manchester, Manchester, United Kingdom
| | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| |
Collapse
|
19
|
Rappoport N, Shamir R. NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics 2020; 35:3348-3356. [PMID: 30698637 PMCID: PMC6748715 DOI: 10.1093/bioinformatics/btz058] [Citation(s) in RCA: 132] [Impact Index Per Article: 26.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2018] [Revised: 12/23/2018] [Accepted: 01/25/2019] [Indexed: 01/10/2023] Open
Abstract
Motivation Cancer subtypes were usually defined based on molecular characterization of single omic data. Increasingly, measurements of multiple omic profiles for the same cohort are available. Defining cancer subtypes using multi-omic data may improve our understanding of cancer, and suggest more precise treatment for patients. Results We present NEMO (NEighborhood based Multi-Omics clustering), a novel algorithm for multi-omics clustering. Importantly, NEMO can be applied to partial datasets in which some patients have data for only a subset of the omics, without performing data imputation. In extensive testing on ten cancer datasets spanning 3168 patients, NEMO achieved results comparable to the best of nine state-of-the-art multi-omics clustering algorithms on full data and showed an improvement on partial data. On some of the partial data tests, PVC, a multi-view algorithm, performed better, but it is limited to two omics and to positive partial data. Finally, we demonstrate the advantage of NEMO in detailed analysis of partial data of AML patients. NEMO is fast and much simpler than existing multi-omics clustering algorithms, and avoids iterative optimization. Availability and implementation Code for NEMO and for reproducing all NEMO results in this paper is in github: https://github.com/Shamir-Lab/NEMO. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nimrod Rappoport
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
20
|
Wei Z, Zhang Y, Weng W, Chen J, Cai H. Survey and comparative assessments of computational multi-omics integrative methods with multiple regulatory networks identifying distinct tumor compositions across pan-cancer data sets. Brief Bioinform 2020; 22:5856342. [PMID: 32533167 DOI: 10.1093/bib/bbaa102] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Revised: 05/02/2020] [Accepted: 05/04/2020] [Indexed: 12/20/2022] Open
Abstract
The significance of pan-cancer categories has recently been recognized as widespread in cancer research. Pan-cancer categorizes a cancer based on its molecular pathology rather than an organ. The molecular similarities among multi-omics data found in different cancer types can play several roles in both biological processes and therapeutic developments. Therefore, an integrated analysis for various genomic data is frequently used to reveal novel genetic and molecular mechanisms. However, a variety of algorithms for multi-omics clustering have been proposed in different fields. The comparison of different computational clustering methods in pan-cancer analysis performance remains unclear. To increase the utilization of current integrative methods in pan-cancer analysis, we first provide an overview of five popular computational integrative tools: similarity network fusion, integrative clustering of multiple genomic data types (iCluster), cancer integration via multi-kernel learning (CIMLR), perturbation clustering for data integration and disease subtyping (PINS) and low-rank clustering (LRACluster). Then, a priori interactions in multi-omics data were incorporated to detect prominent molecular patterns in pan-cancer data sets. Finally, we present comparative assessments of these methods, with discussion over key issues in applying these algorithms. We found that all five methods can identify distinct tumor compositions. The pan-cancer samples can be reclassified into several groups by different proportions. Interestingly, each method can classify the tumors into categories that are different from original cancer types or subtypes, especially for ovarian serous cystadenocarcinoma (OV) and breast invasive carcinoma (BRCA) tumors. In addition, all clusters of the five computational methods show notable prognostic values. Furthermore, both the 9 recurrent differential genes and the 15 common pathway characteristics were identified across all the methods. The results and discussion can help the community select appropriate integrative tools according to different research tasks or aims in pan-cancer analysis.
Collapse
Affiliation(s)
- Zhuohui Wei
- Computer Science and Engineering, South China University of Technology
| | - Yue Zhang
- School of Computer Science, Guangdong Polytechnic Normal University
| | - Wanlin Weng
- Computer Science and Engineering, South China University of Technology
| | - Jiazhou Chen
- Computer Science and Engineering, South China University of Technology
| | - Hongmin Cai
- Computer Science and Engineering, South China University of Technology
| |
Collapse
|
21
|
Multiplex bioimaging of single-cell spatial profiles for precision cancer diagnostics and therapeutics. NPJ Precis Oncol 2020; 4:11. [PMID: 32377572 PMCID: PMC7195402 DOI: 10.1038/s41698-020-0114-1] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Accepted: 03/05/2020] [Indexed: 12/13/2022] Open
Abstract
Cancers exhibit functional and structural diversity in distinct patients. In this mass, normal and malignant cells create tumor microenvironment that is heterogeneous among patients. A residue from primary tumors leaks into the bloodstream as cell clusters and single cells, providing clues about disease progression and therapeutic response. The complexity of these hierarchical microenvironments needs to be elucidated. Although tumors comprise ample cell types, the standard clinical technique is still the histology that is limited to a single marker. Multiplexed imaging technologies open new directions in pathology. Spatially resolved proteomic, genomic, and metabolic profiles of human cancers are now possible at the single-cell level. This perspective discusses spatial bioimaging methods to decipher the cascade of microenvironments in solid and liquid biopsies. A unique synthesis of top-down and bottom-up analysis methods is presented. Spatial multi-omics profiles can be tailored to precision oncology through artificial intelligence. Data-driven patient profiling enables personalized medicine and beyond.
Collapse
|
22
|
Tini G, Marchetti L, Priami C, Scott-Boyer MP. Multi-omics integration-a comparison of unsupervised clustering methodologies. Brief Bioinform 2020; 20:1269-1279. [PMID: 29272335 DOI: 10.1093/bib/bbx167] [Citation(s) in RCA: 84] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2017] [Revised: 11/06/2017] [Indexed: 12/19/2022] Open
Abstract
With the recent developments in the field of multi-omics integration, the interest in factors such as data preprocessing, choice of the integration method and the number of different omics considered had increased. In this work, the impact of these factors is explored when solving the problem of sample classification, by comparing the performances of five unsupervised algorithms: Multiple Canonical Correlation Analysis, Multiple Co-Inertia Analysis, Multiple Factor Analysis, Joint and Individual Variation Explained and Similarity Network Fusion. These methods were applied to three real data sets taken from literature and several ad hoc simulated scenarios to discuss classification performance in different conditions of noise and signal strength across the data types. The impact of experimental design, feature selection and parameter training has been also evaluated to unravel important conditions that can affect the accuracy of the result.
Collapse
|
23
|
Seal DB, Das V, Goswami S, De RK. Estimating gene expression from DNA methylation and copy number variation: A deep learning regression model for multi-omics integration. Genomics 2020; 112:2833-2841. [PMID: 32234433 DOI: 10.1016/j.ygeno.2020.03.021] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2019] [Revised: 03/17/2020] [Accepted: 03/22/2020] [Indexed: 12/21/2022]
Abstract
Gene expression analysis plays a significant role for providing molecular insights in cancer. Various genetic and epigenetic factors (being dealt under multi-omics) affect gene expression giving rise to cancer phenotypes. A recent growth in understanding of multi-omics seems to provide a resource for integration in interdisciplinary biology since they altogether can draw the comprehensive picture of an organism's developmental and disease biology in cancers. Such large scale multi-omics data can be obtained from public consortium like The Cancer Genome Atlas (TCGA) and several other platforms. Integrating these multi-omics data from varied platforms is still challenging due to high noise and sensitivity of the platforms used. Currently, a robust integrative predictive model to estimate gene expression from these genetic and epigenetic data is lacking. In this study, we have developed a deep learning-based predictive model using Deep Denoising Auto-encoder (DDAE) and Multi-layer Perceptron (MLP) that can quantitatively capture how genetic and epigenetic alterations correlate with directionality of gene expression for liver hepatocellular carcinoma (LIHC). The DDAE used in the study has been trained to extract significant features from the input omics data to estimate the gene expression. These features have then been used for back-propagation learning by the multilayer perceptron for the task of regression and classification. We have benchmarked the proposed model against state-of-the-art regression models. Finally, the deep learning-based integration model has been evaluated for its disease classification capability, where an accuracy of 95.1% has been obtained.
Collapse
Affiliation(s)
- Dibyendu Bikash Seal
- A. K. Choudhury School of Information Technology, University of Calcutta, JD-2, Sector III, Salt Lake City, Kolkata 700106, India
| | - Vivek Das
- Novo Nordisk Research Center Seattle, Inc., 530 Fairview Ave N # 5000, Seattle, WA 98109, United States
| | - Saptarsi Goswami
- Bangabasi Morning College, 35 Rajkumar Chakraborty Sarani, Scott Ln, Kolkata 700009, India
| | - Rajat K De
- Machine Intelligence Unit, Indian Statistical Institute, 203 Barrackpore Trunk Road, Kolkata 700108, India.
| |
Collapse
|
24
|
Hulot A, Chiquet J, Jaffrézic F, Rigaill G. Fast tree aggregation for consensus hierarchical clustering. BMC Bioinformatics 2020; 21:120. [PMID: 32197576 PMCID: PMC7085155 DOI: 10.1186/s12859-020-3453-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Accepted: 03/11/2020] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND In unsupervised learning and clustering, data integration from different sources and types is a difficult question discussed in several research areas. For instance in omics analysis, dozen of clustering methods have been developed in the past decade. When a single source of data is at play, hierarchical clustering (HC) is extremely popular, as a tree structure is highly interpretable and arguably more informative than just a partition of the data. However, applying blindly HC to multiple sources of data raises computational and interpretation issues. RESULTS We propose mergeTrees, a method that aggregates a set of trees with the same leaves to create a consensus tree. In our consensus tree, a cluster at height h contains the individuals that are in the same cluster for all the trees at height h. The method is exact and proven to be [Formula: see text], n being the individuals and q being the number of trees to aggregate. Our implementation is extremely effective on simulations, allowing us to process many large trees at a time. We also rely on mergeTrees to perform the cluster analysis of two real -omics data sets, introducing a spectral variant as an efficient and robust by-product. CONCLUSIONS Our tree aggregation method can be used in conjunction with hierarchical clustering to perform efficient cluster analysis. This approach was found to be robust to the absence of clustering information in some of the data sets as well as an increased variability within true clusters. The method is implemented in R/C++ and available as an R package named mergeTrees, which makes it easy to integrate in existing or new pipelines in several research areas.
Collapse
Affiliation(s)
- Audrey Hulot
- Université Paris-Saclay, INRAE, AgroParisTech, GABI, Jouy-en-Josas, 78350 France
- Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA-Paris, Paris, 75005 France
- Université Paris-Saclay, UVSQ, Inserm, Infection et inflammation, Montigny-Le-Bretonneux, 78180 France
| | - Julien Chiquet
- Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA-Paris, Paris, 75005 France
| | - Florence Jaffrézic
- Université Paris-Saclay, INRAE, AgroParisTech, GABI, Jouy-en-Josas, 78350 France
| | - Guillem Rigaill
- Université Paris-Saclay, CNRS, INRAE, Univ Evry, Institute of Plant Sciences Paris-Saclay (IPS2), Orsay, 91405 France
- Université de Paris, CNRS, INRAE, Institute of Plant Sciences Paris-Saclay (IPS2), Orsay, 91405 France
- Université Paris-Saclay, CNRS, Univ Evry, Laboratoire de Mathématiques et Modélisation d’Evry, Evry, 91037 France
| |
Collapse
|
25
|
Kang M, Gao J. Integration of Multi-omics Data for Expression Quantitative Trait Loci (eQTL) Analysis and eQTL Epistasis. Methods Mol Biol 2020; 2082:157-171. [PMID: 31849014 DOI: 10.1007/978-1-0716-0026-9_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Expression quantitative trait loci (eQTL) mapping studies identify genetic loci that regulate gene expression. eQTL mapping studies can capture gene regulatory interactions and provide insight into the genetic mechanism of biological systems. Recently, the integration of multi-omics data, such as single-nucleotide polymorphisms (SNPs), copy number variations (CNVs), DNA methylation, and gene expression, plays an important role in elucidating complex biological systems, since biological systems involve a sequence of complex interactions between various biological processes. This chapter introduces multi-omics data that have been used in many eQTL studies and integrative methodologies that incorporate multi-omics data for eQTL studies. Furthermore, we describe a statistical approach that can detect nonlinear causal relationships between eQTLs, called eQTL epistasis, and its importance.
Collapse
Affiliation(s)
- Mingon Kang
- Department of Computer Science, University of Nevada, Las Vegas, Las Vegas, NV, USA
| | - Jean Gao
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX, USA.
| |
Collapse
|
26
|
Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res 2019; 46:10546-10562. [PMID: 30295871 PMCID: PMC6237755 DOI: 10.1093/nar/gky889] [Citation(s) in RCA: 259] [Impact Index Per Article: 43.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Accepted: 09/20/2018] [Indexed: 12/18/2022] Open
Abstract
Recent high throughput experimental methods have been used to collect large biomedical omics datasets. Clustering of single omic datasets has proven invaluable for biological and medical research. The decreasing cost and development of additional high throughput methods now enable measurement of multi-omic data. Clustering multi-omic data has the potential to reveal further systems-level insights, but raises computational and biological challenges. Here, we review algorithms for multi-omics clustering, and discuss key issues in applying these algorithms. Our review covers methods developed specifically for omic data as well as generic multi-view methods developed in the machine learning community for joint clustering of multiple data types. In addition, using cancer data from TCGA, we perform an extensive benchmark spanning ten different cancer types, providing the first systematic comparison of leading multi-omics and multi-view clustering algorithms. The results highlight key issues regarding the use of single- versus multi-omics, the choice of clustering strategy, the power of generic multi-view methods and the use of approximated p-values for gauging solution quality. Due to the growing use of multi-omics data, we expect these issues to be important for future progress in the field.
Collapse
Affiliation(s)
- Nimrod Rappoport
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
27
|
Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A Selective Review of Multi-Level Omics Data Integration Using Variable Selection. High Throughput 2019; 8:E4. [PMID: 30669303 PMCID: PMC6473252 DOI: 10.3390/ht8010004] [Citation(s) in RCA: 122] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Revised: 12/24/2018] [Accepted: 01/10/2019] [Indexed: 01/02/2023] Open
Abstract
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN 38152, USA.
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06510, USA.
| |
Collapse
|
28
|
Balluff B, Buck A, Martin‐Lorenzo M, Dewez F, Langer R, McDonnell LA, Walch A, Heeren RM. Integrative Clustering in Mass Spectrometry Imaging for Enhanced Patient Stratification. Proteomics Clin Appl 2019; 13:e1800137. [PMID: 30580496 PMCID: PMC6590511 DOI: 10.1002/prca.201800137] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2018] [Revised: 11/28/2018] [Indexed: 12/04/2022]
Abstract
SCOPE In biomedical research, mass spectrometry imaging (MSI) can obtain spatially-resolved molecular information from tissue sections. Especially matrix-assisted laser desorption/ionization (MALDI) MSI offers, depending on the type of matrix, the detection of a broad variety of molecules ranging from metabolites to proteins, thereby facilitating the collection of multilevel molecular data. Lately, integrative clustering techniques have been developed that make use of the complementary information of multilevel molecular data in order to better stratify patient cohorts, but which have not yet been applied in the field of MSI. MATERIALS AND METHODS In this study, the potential of integrative clustering is investigated for multilevel molecular MSI data to subdivide cancer patients into different prognostic groups. Metabolomic and peptidomic data are obtained by MALDI-MSI from a tissue microarray containing material of 46 esophageal cancer patients. The integrative clustering methods Similarity Network Fusion, iCluster, and moCluster are applied and compared to non-integrated clustering. CONCLUSION The results show that the combination of multilevel molecular data increases the capability of integrative algorithms to detect patient subgroups with different clinical outcome, compared to the single level or concatenated data. This underlines the potential of multilevel molecular data from the same subject using MSI for subsequent integrative clustering.
Collapse
Affiliation(s)
- Benjamin Balluff
- Maastricht MultiModal Molecular Imaging institute (M4I)Maastricht University6229 ERMaastrichtThe Netherlands
| | - Achim Buck
- Research Unit Analytical PathologyHelmholtz Zentrum München85764OberschleißheimGermany
| | - Marta Martin‐Lorenzo
- Maastricht MultiModal Molecular Imaging institute (M4I)Maastricht University6229 ERMaastrichtThe Netherlands
| | - Frédéric Dewez
- Maastricht MultiModal Molecular Imaging institute (M4I)Maastricht University6229 ERMaastrichtThe Netherlands
| | - Rupert Langer
- Institute of PathologyUniversity of BernCH‐3008BernSwitzerland
| | | | - Axel Walch
- Research Unit Analytical PathologyHelmholtz Zentrum München85764OberschleißheimGermany
| | - Ron M.A. Heeren
- Maastricht MultiModal Molecular Imaging institute (M4I)Maastricht University6229 ERMaastrichtThe Netherlands
| |
Collapse
|
29
|
Chiu AM, Mitra M, Boymoushakian L, Coller HA. Integrative analysis of the inter-tumoral heterogeneity of triple-negative breast cancer. Sci Rep 2018; 8:11807. [PMID: 30087365 PMCID: PMC6081411 DOI: 10.1038/s41598-018-29992-5] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 07/18/2018] [Indexed: 02/07/2023] Open
Abstract
Triple-negative breast cancers (TNBC) lack estrogen and progesterone receptors and HER2 amplification, and are resistant to therapies that target these receptors. Tumors from TNBC patients are heterogeneous based on genetic variations, tumor histology, and clinical outcomes. We used high throughput genomic data for TNBC patients (n = 137) from TCGA to characterize inter-tumor heterogeneity. Similarity network fusion (SNF)-based integrative clustering combining gene expression, miRNA expression, and copy number variation, revealed three distinct patient clusters. Integrating multiple types of data resulted in more distinct clusters than analyses with a single datatype. Whereas most TNBCs are classified by PAM50 as basal subtype, one of the clusters was enriched in the non-basal PAM50 subtypes, exhibited more aggressive clinical features and had a distinctive signature of oncogenic mutations, miRNAs and expressed genes. Our analyses provide a new classification scheme for TNBC based on multiple omics datasets and provide insight into molecular features that underlie TNBC heterogeneity.
Collapse
Affiliation(s)
- Alec M Chiu
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, USA
| | - Mithun Mitra
- Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, USA.,Department of Biological Chemistry, David Geffen School of Medicine, University of California, Los Angeles, USA
| | - Lari Boymoushakian
- Department of Computer Science, University of California, Los Angeles, USA
| | - Hilary A Coller
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, USA. .,Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, USA. .,Department of Biological Chemistry, David Geffen School of Medicine, University of California, Los Angeles, USA.
| |
Collapse
|
30
|
Misra BB, Langefeld CD, Olivier M, Cox LA. Integrated Omics: Tools, Advances, and Future Approaches. J Mol Endocrinol 2018; 62:JME-18-0055. [PMID: 30006342 DOI: 10.1530/jme-18-0055] [Citation(s) in RCA: 249] [Impact Index Per Article: 35.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Revised: 07/02/2018] [Accepted: 07/12/2018] [Indexed: 12/13/2022]
Abstract
With the rapid adoption of high-throughput omic approaches to analyze biological samples such as genomics, transcriptomics, proteomics, and metabolomics, each analysis can generate tera- to peta-byte sized data files on a daily basis. These data file sizes, together with differences in nomenclature among these data types, make the integration of these multi-dimensional omics data into biologically meaningful context challenging. Variously named as integrated omics, multi-omics, poly-omics, trans-omics, pan-omics, or shortened to just 'omics', the challenges include differences in data cleaning, normalization, biomolecule identification, data dimensionality reduction, biological contextualization, statistical validation, data storage and handling, sharing, and data archiving. The ultimate goal is towards the holistic realization of a 'systems biology' understanding of the biological question in hand. Commonly used approaches in these efforts are currently limited by the 3 i's - integration, interpretation, and insights. Post integration, these very large datasets aim to yield unprecedented views of cellular systems at exquisite resolution for transformative insights into processes, events, and diseases through various computational and informatics frameworks. With the continued reduction in costs and processing time for sample analyses, and increasing types of omics datasets generated such as glycomics, lipidomics, microbiomics, and phenomics, an increasing number of scientists in this interdisciplinary domain of bioinformatics face these challenges. We discuss recent approaches, existing tools, and potential caveats in the integration of omics datasets for development of standardized analytical pipelines that could be adopted by the global omics research community.
Collapse
Affiliation(s)
- Biswapriya B Misra
- B Misra, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Carl D Langefeld
- C Langefeld, Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Michael Olivier
- M Olivier, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| | - Laura A Cox
- L Cox, Internal Medicine, Wake Forest University School of Medicine, Winston-Salem, United States
| |
Collapse
|
31
|
Ruggles KV, Krug K, Wang X, Clauser KR, Wang J, Payne SH, Fenyö D, Zhang B, Mani DR. Methods, Tools and Current Perspectives in Proteogenomics. Mol Cell Proteomics 2017; 16:959-981. [PMID: 28456751 DOI: 10.1074/mcp.mr117.000024] [Citation(s) in RCA: 95] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Indexed: 12/20/2022] Open
Abstract
With combined technological advancements in high-throughput next-generation sequencing and deep mass spectrometry-based proteomics, proteogenomics, i.e. the integrative analysis of proteomic and genomic data, has emerged as a new research field. Early efforts in the field were focused on improving protein identification using sample-specific genomic and transcriptomic sequencing data. More recently, integrative analysis of quantitative measurements from genomic and proteomic studies have identified novel insights into gene expression regulation, cell signaling, and disease. Many methods and tools have been developed or adapted to enable an array of integrative proteogenomic approaches and in this article, we systematically classify published methods and tools into four major categories, (1) Sequence-centric proteogenomics; (2) Analysis of proteogenomic relationships; (3) Integrative modeling of proteogenomic data; and (4) Data sharing and visualization. We provide a comprehensive review of methods and available tools in each category and highlight their typical applications.
Collapse
Affiliation(s)
- Kelly V Ruggles
- From the ‡Department of Medicine, New York University School of Medicine, New York, New York 10016
| | - Karsten Krug
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142
| | - Xiaojing Wang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030.,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - Karl R Clauser
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142
| | - Jing Wang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030.,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - Samuel H Payne
- **Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354
| | - David Fenyö
- ‡‡Department of Biochemistry and Molecular Pharmacology, New York University School of Medicine, New York, New York 10016; .,§§Institute for Systems Genetics, New York University School of Medicine, New York, New York 10016
| | - Bing Zhang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030; .,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - D R Mani
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142;
| |
Collapse
|