1
|
Zhu B, Zhang Z, Leung SY, Fan X. NetMIM: network-based multi-omics integration with block missingness for biomarker selection and disease outcome prediction. Brief Bioinform 2024; 25:bbae454. [PMID: 39288230 PMCID: PMC11407451 DOI: 10.1093/bib/bbae454] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Revised: 07/24/2024] [Accepted: 08/30/2024] [Indexed: 09/19/2024] Open
Abstract
Compared with analyzing omics data from a single platform, an integrative analysis of multi-omics data provides a more comprehensive understanding of the regulatory relationships among biological features associated with complex diseases. However, most existing frameworks for integrative analysis overlook two crucial aspects of multi-omics data. Firstly, they neglect the known dependencies among biological features that exist in highly credible biological databases. Secondly, most existing integrative frameworks just simply remove the subjects without full omics data to handle block missingness, resulting in decreasing statistical power. To overcome these issues, we propose a network-based integrative Bayesian framework for biomarker selection and disease outcome prediction based on multi-omics data. Our framework utilizes Dirac spike-and-slab variable selection prior to identifying a small subset of biomarkers. The incorporation of gene pathway information improves the interpretability of feature selection. Furthermore, with the strategy in the FBM (stand for "full Bayesian model with missingness") model where missing omics data are augmented via a mechanistic model, our framework handles block missingness in multi-omics data via a data augmentation approach. The real application illustrates that our approach, which incorporates existing gene pathway information and includes subjects without DNA methylation data, results in more interpretable feature selection results and more accurate predictions.
Collapse
Affiliation(s)
- Bencong Zhu
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Zhen Zhang
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Suet Yi Leung
- Department of Pathology, School of Clinical Medicine, LKS Faculty of Medicine, The University of Hong Kong, Queen Mary Hospital, Hong Kong SAR, China
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| |
Collapse
|
2
|
Courbariaux M, De Santiago K, Dalmasso C, Danjou F, Bekadar S, Corvol JC, Martinez M, Szafranski M, Ambroise C. A Sparse Mixture-of-Experts Model With Screening of Genetic Associations to Guide Disease Subtyping. Front Genet 2022; 13:859462. [PMID: 35734430 PMCID: PMC9207464 DOI: 10.3389/fgene.2022.859462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 04/21/2022] [Indexed: 11/27/2022] Open
Abstract
Motivation: Identifying new genetic associations in non-Mendelian complex diseases is an increasingly difficult challenge. These diseases sometimes appear to have a significant component of heritability requiring explanation, and this missing heritability may be due to the existence of subtypes involving different genetic factors. Taking genetic information into account in clinical trials might potentially have a role in guiding the process of subtyping a complex disease. Most methods dealing with multiple sources of information rely on data transformation, and in disease subtyping, the two main strategies used are 1) the clustering of clinical data followed by posterior genetic analysis and 2) the concomitant clustering of clinical and genetic variables. Both of these strategies have limitations that we propose to address. Contribution: This work proposes an original method for disease subtyping on the basis of both longitudinal clinical variables and high-dimensional genetic markers via a sparse mixture-of-regressions model. The added value of our approach lies in its interpretability in relation to two aspects. First, our model links both clinical and genetic data with regard to their initial nature (i.e., without transformation) and does not require post-processing where the original information is accessed a second time to interpret the subtypes. Second, it can address large-scale problems because of a variable selection step that is used to discard genetic variables that may not be relevant for subtyping. Results: The proposed method was validated on simulations. A dataset from a cohort of Parkinson's disease patients was also analyzed. Several subtypes of the disease and genetic variants that potentially have a role in this typology were identified. Software availability: The R code for the proposed method, named DiSuGen, and a tutorial are available for download (see the references).
Collapse
Affiliation(s)
- Marie Courbariaux
- Université Paris-Saclay, CNRS, Université d’Évry, Laboratoire de Mathématiques et Modélisation d’Évry, Évry-Courcouronnes, France
| | - Kylliann De Santiago
- Université Paris-Saclay, CNRS, Université d’Évry, Laboratoire de Mathématiques et Modélisation d’Évry, Évry-Courcouronnes, France
| | - Cyril Dalmasso
- Université Paris-Saclay, CNRS, Université d’Évry, Laboratoire de Mathématiques et Modélisation d’Évry, Évry-Courcouronnes, France
| | - Fabrice Danjou
- Sorbonne Université, Paris Brain Institute–ICM, Inserm, CNRS, Assistance Publique Hôpitaux de Paris, Pitié-Salpêtrière Hospital, Department of Neurology, Paris, France
| | - Samir Bekadar
- Sorbonne Université, Paris Brain Institute–ICM, Inserm, CNRS, Assistance Publique Hôpitaux de Paris, Pitié-Salpêtrière Hospital, Department of Neurology, Paris, France
| | - Jean-Christophe Corvol
- Sorbonne Université, Paris Brain Institute–ICM, Inserm, CNRS, Assistance Publique Hôpitaux de Paris, Pitié-Salpêtrière Hospital, Department of Neurology, Paris, France
| | - Maria Martinez
- Institut de Recherche en Santé Digestive, Inserm, CHU Purpan, Toulouse, France
| | - Marie Szafranski
- Université Paris-Saclay, CNRS, Université d’Évry, Laboratoire de Mathématiques et Modélisation d’Évry, Évry-Courcouronnes, France
- ENSIIE, Évry-Courcouronnes, France
| | - Christophe Ambroise
- Université Paris-Saclay, CNRS, Université d’Évry, Laboratoire de Mathématiques et Modélisation d’Évry, Évry-Courcouronnes, France
| |
Collapse
|
3
|
Ma Y, Sun Z, Zeng P, Zhang W, Lin Z. JSNMF enables effective and accurate integrative analysis of single-cell multiomics data. Brief Bioinform 2022; 23:6563185. [PMID: 35380624 DOI: 10.1093/bib/bbac105] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Revised: 02/25/2022] [Accepted: 03/02/2022] [Indexed: 01/18/2023] Open
Abstract
The single-cell multiomics technologies provide an unprecedented opportunity to study the cellular heterogeneity from different layers of transcriptional regulation. However, the datasets generated from these technologies tend to have high levels of noise, making data analysis challenging. Here, we propose jointly semi-orthogonal nonnegative matrix factorization (JSNMF), which is a versatile toolkit for the integrative analysis of transcriptomic and epigenomic data profiled from the same cell. JSNMF enables data visualization and clustering of the cells and also facilitates downstream analysis, including the characterization of markers and functional pathway enrichment analysis. The core of JSNMF is an unsupervised method based on JSNMF, where it assumes different latent variables for the two molecular modalities, and integrates the information of transcriptomic and epigenomic data with consensus graph fusion, which better tackles the distinct characteristics and levels of noise across different molecular modalities in single-cell multiomics data. We applied JSNMF to single-cell multiomics datasets from different tissues and different technologies. The results demonstrate the superior performance of JSNMF in clustering and data visualization of the cells. JSNMF also allows joint analysis of multiple single-cell multiomics experiments and single-cell multiomics data with more than two modalities profiled on the same cell. JSNMF also provides rich biological insight on the markers, cell-type-specific region-gene associations and the functions of the identified cell subpopulation.
Collapse
Affiliation(s)
- Yuanyuan Ma
- School of Computer & Information Engineering, Anyang Normal University, Anyang Henan, China.,Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Zexuan Sun
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Pengcheng Zeng
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Wenyu Zhang
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Zhixiang Lin
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
4
|
Demirel HC, Arici MK, Tuncbag N. Computational approaches leveraging integrated connections of multi-omic data toward clinical applications. Mol Omics 2021; 18:7-18. [PMID: 34734935 DOI: 10.1039/d1mo00158b] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
In line with the advances in high-throughput technologies, multiple omic datasets have accumulated to study biological systems and diseases coherently. No single omics data type is capable of fully representing cellular activity. The complexity of the biological processes arises from the interactions between omic entities such as genes, proteins, and metabolites. Therefore, multi-omic data integration is crucial but challenging. The impact of the molecular alterations in multi-omic data is not local in the neighborhood of the altered gene or protein; rather, the impact diffuses in the network and changes the functionality of multiple signaling pathways and regulation of the gene expression. Additionally, multi-omic data is high-dimensional and has background noise. Several integrative approaches have been developed to accurately interpret the multi-omic datasets, including machine learning, network-based methods, and their combination. In this review, we overview the most recent integrative approaches and tools with a focus on network-based methods. We then discuss these approaches according to their specific applications, from disease-network and biomarker identification to patient stratification, drug discovery, and repurposing.
Collapse
Affiliation(s)
- Habibe Cansu Demirel
- Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey
| | - Muslum Kaan Arici
- Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey.,Foot and Mouth Diseases Institute, Ministry of Agriculture and Forestry, Ankara, 06044, Turkey
| | - Nurcan Tuncbag
- Chemical and Biological Engineering, College of Engineering, Koc University, Istanbul, 34450, Turkey.,School of Medicine, Koc University, Istanbul, 34450, Turkey.,Koc University Research Center for Translational Medicine (KUTTAM), Istanbul, Turkey.
| |
Collapse
|
5
|
Liñares-Blanco J, Pazos A, Fernandez-Lozano C. Machine learning analysis of TCGA cancer data. PeerJ Comput Sci 2021; 7:e584. [PMID: 34322589 PMCID: PMC8293929 DOI: 10.7717/peerj-cs.584] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 05/17/2021] [Indexed: 06/13/2023]
Abstract
In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to study the state of the art, this review is provided to cover the main works that have used ML with TCGA data. Firstly, the principal discoveries made by the TCGA consortium are presented. Once these bases have been established, we begin with the main objective of this study, the identification and discussion of those works that have used the TCGA data for the training of different ML approaches. After a review of more than 100 different papers, it has been possible to make a classification according to following three pillars: the type of tumour, the type of algorithm and the predicted biological problem. One of the conclusions drawn in this work shows a high density of studies based on two major algorithms: Random Forest and Support Vector Machines. We also observe the rise in the use of deep artificial neural networks. It is worth emphasizing, the increase of integrative models of multi-omic data analysis. The different biological conditions are a consequence of molecular homeostasis, driven by both protein coding regions, regulatory elements and the surrounding environment. It is notable that a large number of works make use of genetic expression data, which has been found to be the preferred method by researchers when training the different models. The biological problems addressed have been classified into five types: prognosis prediction, tumour subtypes, microsatellite instability (MSI), immunological aspects and certain pathways of interest. A clear trend was detected in the prediction of these conditions according to the type of tumour. That is the reason for which a greater number of works have focused on the BRCA cohort, while specific works for survival, for example, were centred on the GBM cohort, due to its large number of events. Throughout this review, it will be possible to go in depth into the works and the methodologies used to study TCGA cancer data. Finally, it is intended that this work will serve as a basis for future research in this field of study.
Collapse
Affiliation(s)
- Jose Liñares-Blanco
- CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain
| | - Alejandro Pazos
- CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain
- Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos. Imagen Médica y Diagnóstico Radiológico (RNASA-IMEDIR). Complexo Hospitalario Universitario de A Coruña (CHUAC), SERGAS, Universidade da Coruña, Instituto de Investigación Biomédica de A Coruña (INIBIC), A Coruña, Spain
| | - Carlos Fernandez-Lozano
- CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain
- Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos. Imagen Médica y Diagnóstico Radiológico (RNASA-IMEDIR). Complexo Hospitalario Universitario de A Coruña (CHUAC), SERGAS, Universidade da Coruña, Instituto de Investigación Biomédica de A Coruña (INIBIC), A Coruña, Spain
| |
Collapse
|
6
|
Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief Bioinform 2021; 22:bbaa169. [PMID: 32793970 PMCID: PMC8138889 DOI: 10.1093/bib/bbaa169] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/18/2020] [Accepted: 07/04/2020] [Indexed: 12/12/2022] Open
Abstract
Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a 'lack of information' problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Huangdi Yi
- Department of Biostatistics at Yale University
| | - Shuangge Ma
- Department of Biostatistics at Yale University
| |
Collapse
|
7
|
Liu S, Zhang Y, Shang X, Zhang Z. ProTICS reveals prognostic impact of tumor infiltrating immune cells in different molecular subtypes. Brief Bioinform 2021; 22:6271999. [PMID: 33963834 DOI: 10.1093/bib/bbab164] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 04/05/2021] [Accepted: 04/02/2021] [Indexed: 12/15/2022] Open
Abstract
Different subtypes of the same cancer often show distinct genomic signatures and require targeted treatments. The differences at the cellular and molecular levels of tumor microenvironment in different cancer subtypes have significant effects on tumor pathogenesis and prognostic outcomes. Although there have been significant researches on the prognostic association of tumor infiltrating lymphocytes in selected histological subtypes, few investigations have systemically reported the prognostic impacts of immune cells in molecular subtypes, as quantified by machine learning approaches on multi-omics datasets. This paper describes a new computational framework, ProTICS, to quantify the differences in the proportion of immune cells in tumor microenvironment and estimate their prognostic effects in different subtypes. First, we stratified patients into molecular subtypes based on gene expression and methylation profiles by applying nonnegative tensor factorization technique. Then we quantified the proportion of cell types in each specimen using an mRNA-based deconvolution method. For tumors in each subtype, we estimated the prognostic effects of immune cell types by applying Cox proportional hazard regression. At the molecular level, we also predicted the prognosis of signature genes for each subtype. Finally, we benchmarked the performance of ProTICS on three TCGA datasets and another independent METABRIC dataset. ProTICS successfully stratified tumors into different molecular subtypes manifested by distinct overall survival. Furthermore, the different immune cell types showed distinct prognostic patterns with respect to molecular subtypes. This study provides new insights into the prognostic association between immune cells and molecular subtypes, showing the utility of immune cells as potential prognostic markers. Availability: R code is available at https://github.com/liu-shuhui/ProTICS.
Collapse
Affiliation(s)
- Shuhui Liu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710129, Shaanxi, China.,Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, M5S 3E1, ON, Canada.,Department of Computer Science, University of Toronto, Toronto, M5S 2E4, ON, Canada
| | - Yupei Zhang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710129, Shaanxi, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710129, Shaanxi, China
| | - Zhaolei Zhang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, M5S 3E1, ON, Canada.,Department of Computer Science, University of Toronto, Toronto, M5S 2E4, ON, Canada.,Department of Molecular Genetics, University of Toronto, Toronto, M5S 1A8, ON, Canada
| |
Collapse
|
8
|
Nazarov PV, Kreis S. Integrative approaches for analysis of mRNA and microRNA high-throughput data. Comput Struct Biotechnol J 2021; 19:1154-1162. [PMID: 33680358 PMCID: PMC7895676 DOI: 10.1016/j.csbj.2021.01.029] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Revised: 01/19/2021] [Accepted: 01/20/2021] [Indexed: 12/11/2022] Open
Abstract
Review on tools and databases linking miRNA and its mRNA targetome. Databases show little overlap in miRNA targetome predictions suggesting strong contextual effects. Deconvolution and deep learning approaches are promising new approaches to improve miRNA targetome predictions.
Advanced sequencing technologies such as RNASeq provide the means for production of massive amounts of data, including transcriptome-wide expression levels of coding RNAs (mRNAs) and non-coding RNAs such as miRNAs, lncRNAs, piRNAs and many other RNA species. In silico analysis of datasets, representing only one RNA species is well established and a variety of tools and pipelines are available. However, attaining a more systematic view of how different players come together to regulate the expression of a gene or a group of genes requires a more intricate approach to data analysis. To fully understand complex transcriptional networks, datasets representing different RNA species need to be integrated. In this review, we will focus on miRNAs as key post-transcriptional regulators summarizing current computational approaches for miRNA:target gene prediction as well as new data-driven methods to tackle the problem of comprehensively and accurately dissecting miRNome-targetome interactions.
Collapse
Key Words
- CCA, canonical correlation analysis
- CDS, coding sequence
- CLASH, cross-linking, ligation and sequencing of hybrids
- CLIP, cross-linking immunoprecipitation
- CNN, convolutional neural network
- Data integration
- GO, gene ontology
- ICA, independent component analysis
- Matrix factorization
- NGS, next-generation sequencing
- NMF, non-negative matrix factorization
- PCA, principal component analysis
- RNASeq, high-throughput RNA sequencing
- TDMD, target RNA-directed miRNA degradation
- TF, transcription factors
- Target prediction
- Transcriptomics
- circRNA, circular RNA
- lncRNA, long non-coding RNA
- mRNA, messenger RNA
- miRNA, microRNA
- microRNA
Collapse
Affiliation(s)
- Petr V Nazarov
- Multiomics Data Science Research Group, Department of Oncology & Quantitative Biology Unit, Luxembourg Institute of Health (LIH), Strassen L-1445, Luxembourg
| | - Stephanie Kreis
- Signal Transduction Group, Department of Life Sciences and Medicine, University of Luxembourg, Belvaux L-4367, Luxembourg
| |
Collapse
|
9
|
Abstract
In recent biomedical studies, multidimensional profiling, which collects proteomics as well as other types of omics data on the same subjects, is getting increasingly popular. Proteomics, transcriptomics, genomics, epigenomics, and other types of data contain overlapping as well as independent information, which suggests the possibility of integrating multiple types of data to generate more reliable findings/models with better classification/prediction performance. In this chapter, a selective review is conducted on recent data integration techniques for both unsupervised and supervised analysis. The main objective is to provide the "big picture" of data integration that involves proteomics data and discuss the "intuition" beneath the recently developed approaches without invoking too many mathematical details. Potential pitfalls and possible directions for future developments are also discussed.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Memphis, TN, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, USA.
| |
Collapse
|
10
|
Afrin K, Iquebal AS, Karimi M, Souris A, Lee SY, Mallick BK. Directionally dependent multi-view clustering using copula model. PLoS One 2020; 15:e0238996. [PMID: 33095785 PMCID: PMC7584221 DOI: 10.1371/journal.pone.0238996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Accepted: 08/27/2020] [Indexed: 11/23/2022] Open
Abstract
Recent developments in high-throughput methods have resulted in the collection of high-dimensional data types from multiple sources and technologies that measure distinct yet complementary information. Integrated clustering of such multiple data types or multi-view clustering is critical for revealing pathological insights. However, multi-view clustering is challenging due to the complex dependence structure between multiple data types, including directional dependency. Specifically, genomics data types have pre-specified directional dependencies known as the central dogma that describes the process of information flow from DNA to messenger RNA (mRNA) and then from mRNA to protein. Most of the existing multi-view clustering approaches assume an independent structure or pair-wise (non-directional) dependence between data types, thereby ignoring their directional relationship. Motivated by this, we propose a biology-inspired Bayesian integrated multi-view clustering model that uses an asymmetric copula to accommodate the directional dependencies between the data types. Via extensive simulation experiments, we demonstrate the negative impact of ignoring directional dependency on clustering performance. We also present an application of our model to a real-world dataset of breast cancer tumor samples collected from The Cancer Genome Altas program and provide comparative results.
Collapse
Affiliation(s)
- Kahkashan Afrin
- Department of Industrial & Systems Engineering, Texas A&M University, College Station, TX, United States of America
| | - Ashif S. Iquebal
- Department of Industrial & Systems Engineering, Texas A&M University, College Station, TX, United States of America
| | - Mostafa Karimi
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, United States of America
| | - Allyson Souris
- Department of Statistics, Texas A&M University, College Station, TX, United States of America
| | - Se Yoon Lee
- Department of Statistics, Texas A&M University, College Station, TX, United States of America
| | - Bani K. Mallick
- Department of Statistics, Texas A&M University, College Station, TX, United States of America
| |
Collapse
|
11
|
Gao X, Lee S, Li G, Jung S. Covariate-driven factorization by thresholding for multiblock data. Biometrics 2020; 77:1011-1023. [PMID: 32799349 DOI: 10.1111/biom.13352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Revised: 08/03/2020] [Accepted: 08/07/2020] [Indexed: 11/30/2022]
Abstract
Multiblock data, where multiple groups of variables from different sources are observed for a common set of subjects, are routinely collected in many areas of science. Methods for joint factorization of such multiblock data are being developed to explore the potentially joint variation structure of the data. While most of the existing work focuses on delineating joint components, shared across all data blocks, from individual components, which is only relevant to a single data block, we propose to model and estimate partially joint components across some, but not all, data blocks. If covariates, with potential multiblock structures, are available, then the components are further modeled to be driven by the covariate information. To estimate such a covariate-driven, block-structured factor model, we propose an iterative algorithm based on thresholding, by transforming the problem of signal segmentation into a grouped variable selection problem. The proposed factorization provides accurate estimation of individual and (partially) joint structures in multiblock data, as confirmed by simulation studies. In the analysis of a real multiblock genomic dataset from the Cancer Genome Atlas project, we demonstrate that the estimated block structures provide straightforward interpretation and facilitate subsequent analyses.
Collapse
Affiliation(s)
- Xing Gao
- Department of Statistics, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Sungwon Lee
- Food and Drug Administration, White Oak, Maryland
| | - Gen Li
- Department of Biostatistics, Columbia University, New York, New York
| | - Sungkyu Jung
- Department of Statistics, Seoul National University, Seoul, Korea
| |
Collapse
|
12
|
Pucher BM, Zeleznik OA, Thallinger GG. Comparison and evaluation of integrative methods for the analysis of multilevel omics data: a study based on simulated and experimental cancer data. Brief Bioinform 2020; 20:671-681. [PMID: 29688321 DOI: 10.1093/bib/bby027] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2017] [Revised: 03/02/2018] [Indexed: 12/12/2022] Open
Abstract
Integrative analysis aims to identify the driving factors of a biological process by the joint exploration of data from multiple cellular levels. The volume of omics data produced is constantly increasing, and so too does the collection of tools for its analysis. Comparative studies assessing performance and the biological value of results, however, are rare but in great demand. We present a comprehensive comparison of three integrative analysis approaches, sparse canonical correlation analysis (sCCA), non-negative matrix factorization (NMF) and logic data mining MicroArray Logic Analyzer (MALA), by applying them to simulated and experimental omics data. We find that sCCA and NMF are able to identify differential features in simulated data, while the Logic Data Mining method, MALA, falls short. Applied to experimental data, we show that MALA performs best in terms of sample classification accuracy, and in general, the classification power of prioritized feature sets is high (97.1-99.5% accuracy). The proportion of features identified by at least one of the other methods, however, is approximately 60% for sCCA and NMF and nearly 30% for MALA, and the proportion of features jointly identified by all methods is only around 16%. Similarly, the congruence on functional levels (Gene Ontology, Reactome) is low. Furthermore, the agreement of identified feature sets with curated gene signatures relevant to the investigated disease is modest. We discuss possible reasons for the moderate overlap of identified feature sets with each other and with curated cancer signatures. The R code to create simulated data, results and figures is provided at https://github.com/ThallingerLab/IamComparison.
Collapse
Affiliation(s)
- Bettina M Pucher
- Institute of Computational Biotechnology, Graz University of Technology, Petersgasse 14, 8010 Graz, Austria Omics Center Graz, BioTechMed-Graz, Stiftingtalstrasse 24, 8010 Graz, Austria
| | - Oana A Zeleznik
- Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Ave, Boston MA 02115, USA
| | - Gerhard G Thallinger
- Institute of Computational Biotechnology, Graz University of Technology, Petersgasse 14, 8010 Graz, Austria Omics Center Graz, BioTechMed-Graz, Stiftingtalstrasse 24, 8010 Graz, Austria
| |
Collapse
|
13
|
Integrative Deep Learning for Identifying Differentially Expressed (DE) Biomarkers. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:8418760. [PMID: 31915462 PMCID: PMC6935456 DOI: 10.1155/2019/8418760] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Revised: 06/19/2019] [Accepted: 08/04/2019] [Indexed: 11/17/2022]
Abstract
As a large amount of genetic data are accumulated, an effective analytical method and a significant interpretation are required. Recently, various methods of machine learning have emerged to process genetic data. In addition, machine learning analysis tools using statistical models have been proposed. In this study, we propose adding an integrated layer to the deep learning structure, which would enable the effective analysis of genetic data and the discovery of significant biomarkers of diseases. We conducted a simulation study in order to compare the proposed method with metalogistic regression and meta-SVM methods. The objective function with lasso penalty is used for parameter estimation, and the Youden J index is used for model comparison. The simulation results indicate that the proposed method is more robust for the variance of the data than metalogistic regression and meta-SVM methods. We also conducted real data (breast cancer data (TCGA)) analysis. Based on the results of gene set enrichment analysis, we obtained that TCGA multiple omics data involve significantly enriched pathways which contain information related to breast cancer. Therefore, it is expected that the proposed method will be helpful to discover biomarkers.
Collapse
|
14
|
Fang Z, Ma T, Tang G, Zhu L, Yan Q, Wang T, Celedón JC, Chen W, Tseng GC. Bayesian integrative model for multi-omics data with missingness. Bioinformatics 2019; 34:3801-3808. [PMID: 30184058 DOI: 10.1093/bioinformatics/bty775] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 08/31/2018] [Indexed: 01/14/2023] Open
Abstract
Motivation Integrative analysis of multi-omics data from different high-throughput experimental platforms provides valuable insight into regulatory mechanisms associated with complex diseases, and gains statistical power to detect markers that are otherwise overlooked by single-platform omics analysis. In practice, a significant portion of samples may not be measured completely due to insufficient tissues or restricted budget (e.g. gene expression profile are measured but not methylation). Current multi-omics integrative methods require complete data. A common practice is to ignore samples with any missing platform and perform complete case analysis, which leads to substantial loss of statistical power. Methods In this article, inspired by the popular Integrative Bayesian Analysis of Genomics data (iBAG), we propose a full Bayesian model that allows incorporation of samples with missing omics data. Results Simulation results show improvement of the new full Bayesian approach in terms of outcome prediction accuracy and feature selection performance when sample size is limited and proportion of missingness is large. When sample size is large or the proportion of missingness is low, incorporating samples with missingness may introduce extra inference uncertainty and generate worse prediction and feature selection performance. To determine whether and how to incorporate samples with missingness, we propose a self-learning cross-validation (CV) decision scheme. Simulations and a real application on child asthma dataset demonstrate superior performance of the CV decision scheme when various types of missing mechanisms are evaluated. Availability and implementation Freely available on the GitHub at https://github.com/CHPGenetics/FBM. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhou Fang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA
| | - Tianzhou Ma
- Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD, USA
| | - Gong Tang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA
| | - Li Zhu
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA
| | - Qi Yan
- Division of Pediatric Pulmonology, Allergy and Immunology, Children's Hospital of Pittsburgh of UPMC, Pittsburgh, USA
| | - Ting Wang
- Division of Pediatric Pulmonology, Allergy and Immunology, Children's Hospital of Pittsburgh of UPMC, Pittsburgh, USA
| | - Juan C Celedón
- Division of Pediatric Pulmonology, Allergy and Immunology, Children's Hospital of Pittsburgh of UPMC, Pittsburgh, USA
| | - Wei Chen
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA.,Division of Pediatric Pulmonology, Allergy and Immunology, Children's Hospital of Pittsburgh of UPMC, Pittsburgh, USA
| | - George C Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA
| |
Collapse
|
15
|
Duan R, Gao L, Xu H, Song K, Hu Y, Wang H, Dong Y, Zhang C, Jia S. CEPICS: A Comparison and Evaluation Platform for Integration Methods in Cancer Subtyping. Front Genet 2019; 10:966. [PMID: 31649733 PMCID: PMC6792302 DOI: 10.3389/fgene.2019.00966] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2019] [Accepted: 09/10/2019] [Indexed: 11/17/2022] Open
Abstract
Cancer subtypes can improve our understanding of cancer, and suggest more precise treatment for patients. Multi-omics molecular data can characterize cancers at different levels. Up to now, many computational methods that integrate multi-omics data for cancer subtyping have been proposed. However, there are no consistent criteria to evaluate the integration methods due to the lack of gold standards (e.g., the number of subtypes in a specific cancer). Since comprehensive evaluation and comparison between different methods serves as a useful tool or guideline for users to select an optimal method for their own purpose, we develop a scalable platform, CEPICS, for comprehensively evaluating and comparing multi-omics data integration methods in cancer subtyping. Given a user-specified maximum number of subtypes, k-max, CEPICS provides (1) cancer subtyping results using up to five built-in state-of-the-art integration methods under the number of subtypes from two to k-max, (2) a report including the evaluation of each user-selected method and comparisons across them using clustering performance metrics and clinical survival analysis, and (3) an overall analysis of subtyping results by different methods representing a robust cancer subtype prediction for samples. Furthermore, users can upload subtyping results of their own methods to compare with the built-in methods. CEPICS is implemented as an R package and is freely available at https://github.com/GaoLabXDU/CEPICS.
Collapse
Affiliation(s)
- Ran Duan
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Han Xu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Kuo Song
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yuxuan Hu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Hongda Wang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yongqiang Dong
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Chenxing Zhang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Songwei Jia
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
16
|
Majellano EC, Clark VL, Winter NA, Gibson PG, McDonald VM. Approaches to the assessment of severe asthma: barriers and strategies. J Asthma Allergy 2019; 12:235-251. [PMID: 31692528 PMCID: PMC6712210 DOI: 10.2147/jaa.s178927] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Accepted: 07/25/2019] [Indexed: 12/12/2022] Open
Abstract
Asthma is a chronic condition with great variability. It is characterized by intermittent episodes of wheeze, cough, chest tightness, dyspnea and backed by variable airflow limitation, airway inflammation and airway hyper-responsiveness. Asthma severity varies uniquely between individuals and may change over time. Stratification of asthma severity is an integral part of asthma management linking appropriate treatment to establish asthma control. Precision assessment of severe asthma is crucial for monitoring the health of people with this disease. The literature suggests multiple factors that impede the assessment of severe asthma, these can be grouped into health care professional, patient and organizational related barriers. These barriers do not exist in isolation but interact and influence one another. Recognition of these barriers is necessary to promote precision in the assessment and management of severe asthma in the era of targeted therapy. In this review, we discuss the current knowledge of the barriers that impede assessment in severe asthma and recommend potential strategies for overcoming these barriers. We highlight the relevance of multidimensional assessment as an ideal approach to the assessment and management of severe asthma.
Collapse
Affiliation(s)
- Eleanor C Majellano
- Faculty of Health and Medicine, National Health and Medical Research Council Centre for Research Excellence in Severe Asthma and the Priority Research Centre for Healthy Lungs, The University of Newcastle, Newcastle, NSW, Australia
- Faculty of Health and Medicine, School of Nursing and Midwifery, The University of Newcastle, Newcastle, NSW, Australia
| | - Vanessa L Clark
- Faculty of Health and Medicine, National Health and Medical Research Council Centre for Research Excellence in Severe Asthma and the Priority Research Centre for Healthy Lungs, The University of Newcastle, Newcastle, NSW, Australia
- Faculty of Health and Medicine, School of Nursing and Midwifery, The University of Newcastle, Newcastle, NSW, Australia
| | - Natasha A Winter
- Faculty of Health and Medicine, National Health and Medical Research Council Centre for Research Excellence in Severe Asthma and the Priority Research Centre for Healthy Lungs, The University of Newcastle, Newcastle, NSW, Australia
- Faculty of Health and Medicine, School of Medicine and Public Health, The University of Newcastle, Newcastle, NSW, Australia
| | - Peter G Gibson
- Faculty of Health and Medicine, National Health and Medical Research Council Centre for Research Excellence in Severe Asthma and the Priority Research Centre for Healthy Lungs, The University of Newcastle, Newcastle, NSW, Australia
- Department of Respiratory and Sleep Medicine, John Hunter Hospital, Hunter Medical Research Institute, Newcastle, NSW, Australia
| | - Vanessa M McDonald
- Faculty of Health and Medicine, National Health and Medical Research Council Centre for Research Excellence in Severe Asthma and the Priority Research Centre for Healthy Lungs, The University of Newcastle, Newcastle, NSW, Australia
- Faculty of Health and Medicine, School of Nursing and Midwifery, The University of Newcastle, Newcastle, NSW, Australia
- Department of Respiratory and Sleep Medicine, John Hunter Hospital, Hunter Medical Research Institute, Newcastle, NSW, Australia
| |
Collapse
|
17
|
Kim H, Kim Y, Han B, Jang JY, Kim Y. Clinically Applicable Deep Learning Algorithm Using Quantitative Proteomic Data. J Proteome Res 2019; 18:3195-3202. [DOI: 10.1021/acs.jproteome.9b00268] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
18
|
Mirza B, Wang W, Wang J, Choi H, Chung NC, Ping P. Machine Learning and Integrative Analysis of Biomedical Big Data. Genes (Basel) 2019; 10:E87. [PMID: 30696086 PMCID: PMC6410075 DOI: 10.3390/genes10020087] [Citation(s) in RCA: 182] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2018] [Revised: 01/08/2019] [Accepted: 01/21/2019] [Indexed: 12/11/2022] Open
Abstract
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.
Collapse
Affiliation(s)
- Bilal Mirza
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Wei Wang
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Jie Wang
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Howard Choi
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Neo Christopher Chung
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland.
| | - Peipei Ping
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Medicine (Cardiology), University of California Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
19
|
Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A Selective Review of Multi-Level Omics Data Integration Using Variable Selection. High Throughput 2019; 8:E4. [PMID: 30669303 PMCID: PMC6473252 DOI: 10.3390/ht8010004] [Citation(s) in RCA: 127] [Impact Index Per Article: 21.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Revised: 12/24/2018] [Accepted: 01/10/2019] [Indexed: 01/02/2023] Open
Abstract
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN 38152, USA.
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06510, USA.
| |
Collapse
|
20
|
Min EJ, Chang C, Long Q. Generalized Bayesian Factor Analysis for Integrative Clustering with Applications to Multi-Omics Data. PROCEEDINGS OF THE ... INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS. IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS 2018; 2018:109-119. [PMID: 31106307 PMCID: PMC6521881 DOI: 10.1109/dsaa.2018.00021] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Integrative clustering is a clustering approach for multiple datasets, which provide different views of a common group of subjects. It enables analyzing multi-omics data jointly to, for example, identify the subtypes of diseases, cells, and so on, capturing the complex underlying biological processes more precisely. On the other hand, there has been a great deal of interest in incorporating the prior structural knowledge on the features into statistical analyses over the past decade. The knowledge on the gene regulatory network (pathways) can potentially be incorporated into many genomic studies. In this paper, we propose a novel integrative clustering method which can incorporate the prior graph knowledge. We first develop a generalized Bayesian factor analysis (GBFA) framework, a sparse Bayesian factor analysis which can take into account the graph information. Our GBFA framework employs the spike and slab lasso (SSL) prior to impose sparsity on the factor loadings and the Markov random field (MRF) prior to encourage smoothing over the adjacent factor loadings, which establishes a unified shrinkage adaptive to the loading size and the graph structure. Then, we use the framework to extend iCluster+, a factor analysis based integrative clustering approach. A novel variational EM algorithm is proposed to efficiently estimate the MAP estimator for the factor loadings. Extensive simulation studies and the application to the NCI60 cell line dataset demonstrate that the propose method is superior and delivers more biologically meaningful outcomes.
Collapse
Affiliation(s)
- Eun Jeong Min
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Philadelpia, USA
| | - Changgee Chang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Philadelpia, USA
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Philadelpia, USA
| |
Collapse
|
21
|
Sidhaye VK, Nishida K, Martinez FJ. Precision medicine in COPD: where are we and where do we need to go? Eur Respir Rev 2018; 27:180022. [PMID: 30068688 PMCID: PMC6156790 DOI: 10.1183/16000617.0022-2018] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 06/18/2018] [Indexed: 12/15/2022] Open
Abstract
Chronic obstructive pulmonary disease (COPD) was the fourth leading cause of death worldwide in 2015. Current treatments for patients ease discomfort and help decrease disease progression; however, none improve lung function or change mortality. COPD is heterogeneous in its molecular and clinical presentation, making it difficult to understand disease aetiology and define robust therapeutic strategies. Given the complexity of the disease we propose a precision medicine approach to understanding and better treating COPD. It is possible that multiOMICs can be used as a tool to integrate data from multiple fields. Moreover, analysis of electronic medical records could aid in the treatment of patients and in the predictions of outcomes. The Precision Medicine Initiative created in 2015 has made precision medicine approaches to treat disease a reality; one of these diseases being COPD.
Collapse
Affiliation(s)
- Venkataramana K. Sidhaye
- Division of Pulmonary and Critical Care Medicine, Dept of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
- Dept of Environmental Health and Engineering, Johns Hopkins School of Public Health, Baltimore, MD, USA
| | - Kristine Nishida
- Division of Pulmonary and Critical Care Medicine, Dept of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Fernando J. Martinez
- Division of Pulmonary and Critical Care Medicine, Dept of Medicine, University of Michigan Health System, Ann Arbor, MI, USA
| |
Collapse
|
22
|
Ma T, Song C, Tseng GC. Discussant paper on ‘Statistical contributions to bioinformatics: Design, modelling, structure learning and integration’. STAT MODEL 2017. [DOI: 10.1177/1471082x17705992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Affiliation(s)
- Tianzhou Ma
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh Pittsburgh, PA, USA
| | - Chi Song
- Division of Biostatistics, College of Public Health, Ohio State University, Columbus, OH, USA
| | - George C. Tseng
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh Pittsburgh, PA, USA
| |
Collapse
|
23
|
Node-Structured Integrative Gaussian Graphical Model Guided by Pathway Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2017; 2017:8520480. [PMID: 28487748 PMCID: PMC5405575 DOI: 10.1155/2017/8520480] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 02/20/2017] [Accepted: 03/06/2017] [Indexed: 12/23/2022]
Abstract
Up to date, many biological pathways related to cancer have been extensively applied thanks to outputs of burgeoning biomedical research. This leads to a new technical challenge of exploring and validating biological pathways that can characterize transcriptomic mechanisms across different disease subtypes. In pursuit of accommodating multiple studies, the joint Gaussian graphical model was previously proposed to incorporate nonzero edge effects. However, this model is inevitably dependent on post hoc analysis in order to confirm biological significance. To circumvent this drawback, we attempt not only to combine transcriptomic data but also to embed pathway information, well-ascertained biological evidence as such, into the model. To this end, we propose a novel statistical framework for fitting joint Gaussian graphical model simultaneously with informative pathways consistently expressed across multiple studies. In theory, structured nodes can be prespecified with multiple genes. The optimization rule employs the structured input-output lasso model, in order to estimate a sparse precision matrix constructed by simultaneous effects of multiple studies and structured nodes. With an application to breast cancer data sets, we found that the proposed model is superior in efficiently capturing structures of biological evidence (e.g., pathways). An R software package nsiGGM is publicly available at author's webpage.
Collapse
|
24
|
Kim S, Jhong JH, Lee J, Koo JY. Meta-analytic support vector machine for integrating multiple omics data. BioData Min 2017; 10:2. [PMID: 28149325 PMCID: PMC5270233 DOI: 10.1186/s13040-017-0126-8] [Citation(s) in RCA: 99] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Accepted: 01/11/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Of late, high-throughput microarray and sequencing data have been extensively used to monitor biomarkers and biological processes related to many diseases. Under this circumstance, the support vector machine (SVM) has been popularly used and been successful for gene selection in many applications. Despite surpassing benefits of the SVMs, single data analysis using small- and mid-size of data inevitably runs into the problem of low reproducibility and statistical power. To address this problem, we propose a meta-analytic support vector machine (Meta-SVM) that can accommodate multiple omics data, making it possible to detect consensus genes associated with diseases across studies. RESULTS Experimental studies show that the Meta-SVM is superior to the existing meta-analysis method in detecting true signal genes. In real data applications, diverse omics data of breast cancer (TCGA) and mRNA expression data of lung disease (idiopathic pulmonary fibrosis; IPF) were applied. As a result, we identified gene sets consistently associated with the diseases across studies. In particular, the ascertained gene set of TCGA omics data was found to be significantly enriched in the ABC transporters pathways well known as critical for the breast cancer mechanism. CONCLUSION The Meta-SVM effectively achieves the purpose of meta-analysis as jointly leveraging multiple omics data, and facilitates identifying potential biomarkers and elucidating the disease process.
Collapse
Affiliation(s)
- SungHwan Kim
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea.,Department of Statistics, Keimyung University, Dalseoku, Daegu, 42601 South Korea
| | - Jae-Hwan Jhong
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea
| | - JungJun Lee
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea
| | - Ja-Yong Koo
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea
| |
Collapse
|