1
|
Li W, Ballard J, Zhao Y, Long Q. Knowledge-guided learning methods for integrative analysis of multi-omics data. Comput Struct Biotechnol J 2024; 23:1945-1950. [PMID: 38736693 PMCID: PMC11087912 DOI: 10.1016/j.csbj.2024.04.053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/14/2024] Open
Abstract
Integrative analysis of multi-omics data has the potential to yield valuable and comprehensive insights into the molecular mechanisms underlying complex diseases such as cancer and Alzheimer's disease. However, a number of analytical challenges complicate multi-omics data integration. For instance, -omics data are usually high-dimensional, and sample sizes in multi-omics studies tend to be modest. Furthermore, when genes in an important pathway have relatively weak signal, it can be difficult to detect them individually. There is a growing body of literature on knowledge-guided learning methods that can address these challenges by incorporating biological knowledge such as functional genomics and functional proteomics into multi-omics data analysis. These methods have been shown to outperform their counterparts that do not utilize biological knowledge in tasks including prediction, feature selection, clustering, and dimension reduction. In this review, we survey recently developed methods and applications of knowledge-guided multi-omics data integration methods and discuss future research directions.
Collapse
Affiliation(s)
- Wenrui Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, 19104, PA, USA
| | - Jenna Ballard
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, 19104, PA, USA
| | - Yize Zhao
- Department of Biostatistics, School of Public Health, Yale University, 60 College Street, New Haven, 06510, CT, USA
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, 19104, PA, USA
| |
Collapse
|
2
|
Mishra AK, Mahmud I, Lorenzi PL, Jenq RR, Wargo JA, Ajami NJ, Peterson CB. TARO: tree-aggregated factor regression for microbiome data integration. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.17.562792. [PMID: 37904958 PMCID: PMC10614880 DOI: 10.1101/2023.10.17.562792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Motivation Although the human microbiome plays a key role in health and disease, the biological mechanisms underlying the interaction between the microbiome and its host are incompletely understood. Integration with other molecular profiling data offers an opportunity to characterize the role of the microbiome and elucidate therapeutic targets. However, this remains challenging to the high dimensionality, compositionality, and rare features found in microbiome profiling data. These challenges necessitate the use of methods that can achieve structured sparsity in learning cross-platform association patterns. Results We propose Tree-Aggregated factor RegressiOn (TARO) for the integration of microbiome and metabolomic data. We leverage information on the phylogenetic tree structure to flexibly aggregate rare features. We demonstrate through simulation studies that TARO accurately recovers a low-rank coefficient matrix and identifies relevant features. We applied TARO to microbiome and metabolomic profiles gathered from subjects being screened for colorectal cancer to understand how gut microrganisms shape intestinal metabolite abundances. Availability and implementation The R package TARO implementing the proposed methods is available online at https://github.com/amishra-stats/taro-package .
Collapse
|
3
|
Downing T, Angelopoulos N. A primer on correlation-based dimension reduction methods for multi-omics analysis. J R Soc Interface 2023; 20:20230344. [PMID: 37817584 PMCID: PMC10565429 DOI: 10.1098/rsif.2023.0344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 09/19/2023] [Indexed: 10/12/2023] Open
Abstract
The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will help researchers navigate emerging methods for multi-omics and integrating diverse omic datasets appropriately. This raises the opportunity of implementing population multi-omics with large sample sizes as omics technologies and our understanding improve.
Collapse
Affiliation(s)
- Tim Downing
- Pirbright Institute, Pirbright, Surrey, UK
- Department of Biotechnology, Dublin City University, Dublin, Ireland
| | | |
Collapse
|
4
|
Valberg SJ, Velez-Irizarry D, Williams ZJ, Henry ML, Iglewski H, Herrick K, Fenger C. Enriched Pathways of Calcium Regulation, Cellular/Oxidative Stress, Inflammation, and Cell Proliferation Characterize Gluteal Muscle of Standardbred Horses between Episodes of Recurrent Exertional Rhabdomyolysis. Genes (Basel) 2022; 13:1853. [PMID: 36292738 PMCID: PMC9601720 DOI: 10.3390/genes13101853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 10/06/2022] [Accepted: 10/07/2022] [Indexed: 11/04/2022] Open
Abstract
Certain Standardbred racehorses develop recurrent exertional rhabdomyolysis (RER-STD) for unknown reasons. We compared gluteal muscle histopathology and gene/protein expression between Standardbreds with a history of, but not currently experiencing rhabdomyolysis (N = 9), and race-trained controls (N = 7). Eight RER-STD had a few mature fibers with small internalized myonuclei, one out of nine had histologic evidence of regeneration and zero out of nine degeneration. However, RER-STD versus controls had 791/13,531 differentially expressed genes (DEG). The top three gene ontology (GO) enriched pathways for upregulated DEG (N = 433) were inflammation/immune response (62 GO terms), cell proliferation (31 GO terms), and hypoxia/oxidative stress (31 GO terms). Calcium ion regulation (39 GO terms), purine nucleotide metabolism (32 GO terms), and electron transport (29 GO terms) were the top three enriched GO pathways for down-regulated DEG (N = 305). DEG regulated RYR1 and sarcoplasmic reticulum calcium stores. Differentially expressed proteins (DEP ↑N = 50, ↓N = 12) involved the sarcomere (24% of DEP), electron transport (23%), metabolism (20%), inflammation (6%), cell/oxidative stress (7%), and other (17%). DEP included ↑superoxide dismutase, ↑catalase, and DEP/DEG included several cysteine-based antioxidants. In conclusion, gluteal muscle of RER-susceptible Standardbreds is characterized by perturbation of pathways for calcium regulation, cellular/oxidative stress, inflammation, and cellular regeneration weeks after an episode of rhabdomyolysis that could represent therapeutic targets.
Collapse
Affiliation(s)
- Stephanie J. Valberg
- Mary Anne McPhail Equine Performance Center, Department of Large Animal Clinical Sciences, College of Veterinary Medicine, Michigan State University, East Lansing, MI 48824, USA
| | - Deborah Velez-Irizarry
- Mary Anne McPhail Equine Performance Center, Department of Large Animal Clinical Sciences, College of Veterinary Medicine, Michigan State University, East Lansing, MI 48824, USA
| | - Zoë J. Williams
- Mary Anne McPhail Equine Performance Center, Department of Large Animal Clinical Sciences, College of Veterinary Medicine, Michigan State University, East Lansing, MI 48824, USA
| | - Marisa L. Henry
- Mary Anne McPhail Equine Performance Center, Department of Large Animal Clinical Sciences, College of Veterinary Medicine, Michigan State University, East Lansing, MI 48824, USA
| | - Hailey Iglewski
- Mary Anne McPhail Equine Performance Center, Department of Large Animal Clinical Sciences, College of Veterinary Medicine, Michigan State University, East Lansing, MI 48824, USA
| | - Keely Herrick
- Mary Anne McPhail Equine Performance Center, Department of Large Animal Clinical Sciences, College of Veterinary Medicine, Michigan State University, East Lansing, MI 48824, USA
| | - Clara Fenger
- Equine Integrated Medicine, PLC, Lexington, KY 40324, USA
| |
Collapse
|
5
|
Brault C, Lazerges J, Doligez A, Thomas M, Ecarnot M, Roumet P, Bertrand Y, Berger G, Pons T, François P, Le Cunff L, This P, Segura V. Interest of phenomic prediction as an alternative to genomic prediction in grapevine. PLANT METHODS 2022; 18:108. [PMID: 36064570 PMCID: PMC9442960 DOI: 10.1186/s13007-022-00940-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 07/24/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND Phenomic prediction has been defined as an alternative to genomic prediction by using spectra instead of molecular markers. A reflectance spectrum provides information on the biochemical composition within a tissue, itself being under genetic determinism. Thus, a relationship matrix built from spectra could potentially capture genetic signal. This new methodology has been mainly applied in several annual crop species but little is known so far about its interest in perennial species. Besides, phenomic prediction has only been tested for a restricted set of traits, mainly related to yield or phenology. This study aims at applying phenomic prediction for the first time in grapevine, using spectra collected on two tissues and over two consecutive years, on two populations and for 15 traits, related to berry composition, phenology, morphological and vigour. A major novelty of this study was to collect spectra and phenotypes several years apart from each other. First, we characterized the genetic signal in spectra and under which condition it could be maximized, then phenomic predictive ability was compared to genomic predictive ability. RESULTS For the first time, we showed that the similarity between spectra and genomic relationship matrices was stable across tissues or years, but variable across populations, with co-inertia around 0.3 and 0.6 for diversity panel and half-diallel populations, respectively. Applying a mixed model on spectra data increased phenomic predictive ability, while using spectra collected on wood or leaves from one year or another had less impact. Differences between populations were also observed for predictive ability of phenomic prediction, with an average of 0.27 for the diversity panel and 0.35 for the half-diallel. For both populations, a significant positive correlation was found across traits between predictive ability of genomic and phenomic predictions. CONCLUSION NIRS is a new low-cost alternative to genotyping for predicting complex traits in perennial species such as grapevine. Having spectra and phenotypes from different years allowed us to exclude genotype-by-environment interactions and confirms that phenomic prediction can rely only on genetics.
Collapse
Affiliation(s)
- Charlotte Brault
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France
- Institut Français de la vigne et du vin, 34398, Montpellier, France
| | - Juliette Lazerges
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France
| | - Agnès Doligez
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France
| | - Miguel Thomas
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France
| | - Martin Ecarnot
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
| | - Pierre Roumet
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
| | - Yves Bertrand
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France
| | - Gilles Berger
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France
| | - Thierry Pons
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France
| | - Pierre François
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France
| | - Loïc Le Cunff
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France
- Institut Français de la vigne et du vin, 34398, Montpellier, France
| | - Patrice This
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France
| | - Vincent Segura
- UMR AGAP Institut, Univ Montpellier, CIRAD, INRAE, Institut Agro Montpellier, Montpellier, 34398, France.
- UMT Geno-Vigne®, IFV, INRAE, Institut Agro Montpellier, 34398, Montpellier, France.
| |
Collapse
|
6
|
Speller J, Staerk C, Mayr A. Robust statistical boosting with quantile-based adaptive loss functions. Int J Biostat 2022:ijb-2021-0127. [PMID: 35950232 DOI: 10.1515/ijb-2021-0127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 06/20/2022] [Indexed: 11/15/2022]
Abstract
We combine robust loss functions with statistical boosting algorithms in an adaptive way to perform variable selection and predictive modelling for potentially high-dimensional biomedical data. To achieve robustness against outliers in the outcome variable (vertical outliers), we consider different composite robust loss functions together with base-learners for linear regression. For composite loss functions, such as the Huber loss and the Bisquare loss, a threshold parameter has to be specified that controls the robustness. In the context of boosting algorithms, we propose an approach that adapts the threshold parameter of composite robust losses in each iteration to the current sizes of residuals, based on a fixed quantile level. We compared the performance of our approach to classical M-regression, boosting with standard loss functions or the lasso regarding prediction accuracy and variable selection in different simulated settings: the adaptive Huber and Bisquare losses led to a better performance when the outcome contained outliers or was affected by specific types of corruption. For non-corrupted data, our approach yielded a similar performance to boosting with the efficient L 2 loss or the lasso. Also in the analysis of skewed KRT19 protein expression data based on gene expression measurements from human cancer cell lines (NCI-60 cell line panel), boosting with the new adaptive loss functions performed favourably compared to standard loss functions or competing robust approaches regarding prediction accuracy and resulted in very sparse models.
Collapse
Affiliation(s)
- Jan Speller
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| | - Christian Staerk
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| | - Andreas Mayr
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| |
Collapse
|
7
|
Wang J, Safo SE. Deep IDA: A Deep Learning Method for Integrative Discriminant Analysis of Multi-View Data with Feature Ranking-An Application to COVID-19 severity. ARXIV 2021:arXiv:2111.09964v2. [PMID: 34815984 PMCID: PMC8609900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Revised: 11/24/2021] [Indexed: 12/27/2022]
Abstract
COVID-19 severity is due to complications from SARS-Cov-2 but the clinical course of the infection varies for individuals, emphasizing the need to better understand the disease at the molecular level. We use clinical and multiple molecular data (or views) obtained from patients with and without COVID-19 who were (or not) admitted to the intensive care unit to shed light on COVID-19 severity. Methods for jointly associating the views and separating the COVID-19 groups (i.e., one-step methods) have focused on linear relationships. The relationships between the views and COVID-19 patient groups, however, are too complex to be understood solely by linear methods. Existing nonlinear one-step methods cannot be used to identify signatures to aid in our understanding of the complexity of the disease. We propose Deep IDA (Integrative Discriminant Analysis) to address analytical challenges in our problem of interest. Deep IDA learns nonlinear projections of two or more views that maximally associate the views and separate the classes in each view, and permits feature ranking for interpretable findings. Our applications demonstrate that Deep IDA has competitive classification rates compared to other state-of-the-art methods and is able to identify molecular signatures that facilitate an understanding of COVID-19 severity.
Collapse
Affiliation(s)
- Jiuzhou Wang
- Division of Biostatistics, University of Minnesota, MN
| | - Sandra E Safo
- Division of Biostatistics, University of Minnesota, MN
| |
Collapse
|
8
|
Integrated proteomic and transcriptomic profiling identifies aberrant gene and protein expression in the sarcomere, mitochondrial complex I, and the extracellular matrix in Warmblood horses with myofibrillar myopathy. BMC Genomics 2021; 22:438. [PMID: 34112090 PMCID: PMC8194174 DOI: 10.1186/s12864-021-07758-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 05/26/2021] [Indexed: 02/06/2023] Open
Abstract
Background Myofibrillar myopathy in humans causes protein aggregation, degeneration, and weakness of skeletal muscle. In horses, myofibrillar myopathy is a late-onset disease of unknown origin characterized by poor performance, atrophy, myofibrillar disarray, and desmin aggregation in skeletal muscle. This study evaluated molecular and ultrastructural signatures of myofibrillar myopathy in Warmblood horses through gluteal muscle tandem-mass-tag quantitative proteomics (5 affected, 4 control), mRNA-sequencing (8 affected, 8 control), amalgamated gene ontology analyses, and immunofluorescent and electron microscopy. Results We identified 93/1533 proteins and 47/27,690 genes that were significantly differentially expressed. The top significantly differentially expressed protein CSRP3 and three other differentially expressed proteins, including, PDLIM3, SYNPO2, and SYNPOL2, are integrally involved in Z-disc signaling, gene transcription and subsequently sarcomere integrity. Through immunofluorescent staining, both desmin aggregates and CSRP3 were localized to type 2A fibers. The highest differentially expressed gene CHAC1, whose protein product degrades glutathione, is associated with oxidative stress and apoptosis. Amalgamated transcriptomic and proteomic gene ontology analyses identified 3 enriched cellular locations; the sarcomere (Z-disc & I-band), mitochondrial complex I and the extracellular matrix which corresponded to ultrastructural Z-disc disruption and mitochondrial cristae alterations found with electron microscopy. Conclusions A combined proteomic and transcriptomic analysis highlighted three enriched cellular locations that correspond with MFM ultrastructural pathology in Warmblood horses. Aberrant Z-disc mechano-signaling, impaired Z-disc stability, decreased mitochondrial complex I expression, and a pro-oxidative cellular environment are hypothesized to contribute to the development of myofibrillar myopathy in Warmblood horses. These molecular signatures may provide further insight into diagnostic biomarkers, treatments, and the underlying pathophysiology of MFM. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07758-0.
Collapse
|
9
|
TSCCA: A tensor sparse CCA method for detecting microRNA-gene patterns from multiple cancers. PLoS Comput Biol 2021; 17:e1009044. [PMID: 34061840 PMCID: PMC8195367 DOI: 10.1371/journal.pcbi.1009044] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2020] [Revised: 06/11/2021] [Accepted: 05/05/2021] [Indexed: 12/22/2022] Open
Abstract
Existing studies have demonstrated that dysregulation of microRNAs (miRNAs or miRs) is involved in the initiation and progression of cancer. Many efforts have been devoted to identify microRNAs as potential biomarkers for cancer diagnosis, prognosis and therapeutic targets. With the rapid development of miRNA sequencing technology, a vast amount of miRNA expression data for multiple cancers has been collected. These invaluable data repositories provide new paradigms to explore the relationship between miRNAs and cancer. Thus, there is an urgent need to explore the complex cancer-related miRNA-gene patterns by integrating multi-omics data in a pan-cancer paradigm. In this study, we present a tensor sparse canonical correlation analysis (TSCCA) method for identifying cancer-related miRNA-gene modules across multiple cancers. TSCCA is able to overcome the drawbacks of existing solutions and capture both the cancer-shared and specific miRNA-gene co-expressed modules with better biological interpretations. We comprehensively evaluate the performance of TSCCA using a set of simulated data and matched miRNA/gene expression data across 33 cancer types from the TCGA database. We uncover several dysfunctional miRNA-gene modules with important biological functions and statistical significance. These modules can advance our understanding of miRNA regulatory mechanisms of cancer and provide insights into miRNA-based treatments for cancer. MicroRNAs (miRNAs) are a class of small non-coding RNAs. Previous studies have revealed that miRNA-gene regulatory modules play key roles in the occurrence and development of cancer. However, little has been done to discover miRNA-gene regulatory modules from a pan-cancer view. Thus, it is urgently needed to develop new methods to explore the complex cancer-related miRNA-gene patterns by integrating multi-omics data of multi-cancers. To build the connections between miRNA-gene regulatory modules across different cancer types, we propose a tensor sparse canonical correlation analysis (TSCCA) method. Our specific contributions are two-fold: (1) We propose a sparse statistical learning model TSCCA and an efficient block-coordinate descent algorithm to solve it. (2) We apply TSCCA to a multi-omics data set of 33 cancer types from TCGA and identify some cancer-related miRNA-gene modules with important biological functions and statistical significance.
Collapse
|
10
|
Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief Bioinform 2021; 22:bbaa169. [PMID: 32793970 PMCID: PMC8138889 DOI: 10.1093/bib/bbaa169] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/18/2020] [Accepted: 07/04/2020] [Indexed: 12/12/2022] Open
Abstract
Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a 'lack of information' problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Huangdi Yi
- Department of Biostatistics at Yale University
| | - Shuangge Ma
- Department of Biostatistics at Yale University
| |
Collapse
|
11
|
Safo SE, Min EJ, Haine L. Sparse linear discriminant analysis for multiview structured data. Biometrics 2021; 78:612-623. [PMID: 33739448 DOI: 10.1111/biom.13458] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Revised: 02/15/2021] [Accepted: 03/04/2021] [Indexed: 11/28/2022]
Abstract
Classification methods that leverage the strengths of data from multiple sources (multiview data) simultaneously have enormous potential to yield more powerful findings than two-step methods: association followed by classification. We propose two methods, sparse integrative discriminant analysis (SIDA), and SIDA with incorporation of network information (SIDANet), for joint association and classification studies. The methods consider the overall association between multiview data, and the separation within each view in choosing discriminant vectors that are associated and optimally separate subjects into different classes. SIDANet is among the first methods to incorporate prior structural information in joint association and classification studies. It uses the normalized Laplacian of a graph to smooth coefficients of predictor variables, thus encouraging selection of predictors that are connected. We demonstrate the effectiveness of our methods on a set of synthetic datasets and explore their use in identifying potential nontraditional risk factors that discriminate healthy patients at low versus high risk for developing atherosclerosis cardiovascular disease in 10 years. Our findings underscore the benefit of joint association and classification methods if the goal is to correlate multiview data and to perform classification.
Collapse
Affiliation(s)
- Sandra E Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Eun Jeong Min
- Department of Medical Life Sciences, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
| | - Lillian Haine
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
12
|
Abstract
In recent biomedical studies, multidimensional profiling, which collects proteomics as well as other types of omics data on the same subjects, is getting increasingly popular. Proteomics, transcriptomics, genomics, epigenomics, and other types of data contain overlapping as well as independent information, which suggests the possibility of integrating multiple types of data to generate more reliable findings/models with better classification/prediction performance. In this chapter, a selective review is conducted on recent data integration techniques for both unsupervised and supervised analysis. The main objective is to provide the "big picture" of data integration that involves proteomics data and discuss the "intuition" beneath the recently developed approaches without invoking too many mathematical details. Potential pitfalls and possible directions for future developments are also discussed.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Memphis, TN, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, USA.
| |
Collapse
|
13
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
14
|
Min EJ, Long Q. Sparse multiple co-Inertia analysis with application to integrative analysis of multi -Omics data. BMC Bioinformatics 2020; 21:141. [PMID: 32293260 PMCID: PMC7157996 DOI: 10.1186/s12859-020-3455-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Accepted: 03/13/2020] [Indexed: 01/28/2023] Open
Abstract
Background Multiple co-inertia analysis (mCIA) is a multivariate analysis method that can assess relationships and trends in multiple datasets. Recently it has been used for integrative analysis of multiple high-dimensional -omics datasets. However, its estimated loading vectors are non-sparse, which presents challenges for identifying important features and interpreting analysis results. We propose two new mCIA methods: 1) a sparse mCIA method that produces sparse loading estimates and 2) a structured sparse mCIA method that further enables incorporation of structural information among variables such as those from functional genomics. Results Our extensive simulation studies demonstrate the superior performance of the sparse mCIA and structured sparse mCIA methods compared to the existing mCIA in terms of feature selection and estimation accuracy. Application to the integrative analysis of transcriptomics data and proteomics data from a cancer study identified biomarkers that are suggested in the literature related with cancer disease. Conclusion Proposed sparse mCIA achieves simultaneous model estimation and feature selection and yields analysis results that are more interpretable than the existing mCIA. Furthermore, proposed structured sparse mCIA can effectively incorporate prior network information among genes, resulting in improved feature selection and enhanced interpretability.
Collapse
Affiliation(s)
- Eun Jeong Min
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, 423 Guardian Dr, Philadelphia, 19104, USA
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, 423 Guardian Dr, Philadelphia, 19104, USA.
| |
Collapse
|
15
|
Jiang D, Armour CR, Hu C, Mei M, Tian C, Sharpton TJ, Jiang Y. Microbiome Multi-Omics Network Analysis: Statistical Considerations, Limitations, and Opportunities. Front Genet 2019; 10:995. [PMID: 31781153 PMCID: PMC6857202 DOI: 10.3389/fgene.2019.00995] [Citation(s) in RCA: 83] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 09/18/2019] [Indexed: 12/21/2022] Open
Abstract
The advent of large-scale microbiome studies affords newfound analytical opportunities to understand how these communities of microbes operate and relate to their environment. However, the analytical methodology needed to model microbiome data and integrate them with other data constructs remains nascent. This emergent analytical toolset frequently ports over techniques developed in other multi-omics investigations, especially the growing array of statistical and computational techniques for integrating and representing data through networks. While network analysis has emerged as a powerful approach to modeling microbiome data, oftentimes by integrating these data with other types of omics data to discern their functional linkages, it is not always evident if the statistical details of the approach being applied are consistent with the assumptions of microbiome data or how they impact data interpretation. In this review, we overview some of the most important network methods for integrative analysis, with an emphasis on methods that have been applied or have great potential to be applied to the analysis of multi-omics integration of microbiome data. We compare advantages and disadvantages of various statistical tools, assess their applicability to microbiome data, and discuss their biological interpretability. We also highlight on-going statistical challenges and opportunities for integrative network analysis of microbiome data.
Collapse
Affiliation(s)
- Duo Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Courtney R Armour
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Chenxiao Hu
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Meng Mei
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Chuan Tian
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Thomas J Sharpton
- Department of Statistics, Oregon State University, Corvallis, OR, United States
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Yuan Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| |
Collapse
|
16
|
Piazzese D, Bonanno A, Bongiorno D, Falco F, Indelicato S, Milisenda G, Vazzana I, Cammarata M. Co-inertia multivariate approach for the evaluation of anthropogenic impact on two commercial fish along Tyrrhenian coasts. ECOTOXICOLOGY AND ENVIRONMENTAL SAFETY 2019; 182:109435. [PMID: 31326728 DOI: 10.1016/j.ecoenv.2019.109435] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 07/05/2019] [Accepted: 07/08/2019] [Indexed: 06/10/2023]
Abstract
Aliphatic hydrocarbon levels were determined by the GC/MS technique in fish livers of Engraulis encrasicolus (Ee) and Trachurus trachurus (Tt), collected from a particular area of the Mediterranean Sea, called GSA 10, which is located exactly in Tyrrhenian Sea between Campania coast and North Sicily coast. The aim was to evaluate their potential use as specific bioindicators towards this class of contaminants. Both Tt and Ee are considered to be pollution monitoring bioindicators, due to their dominance in marine communities and economic fishing interest. Ee showed a higher tendency to bioaccumulate TAHs, due to the lower quantity of fatty acids in liver tissues with respect to Tt. The area under study has been characterised a) chemically with the acquisition of temperature, oxygen and salinity profiles along the water column, and b) ecologically with the determination of amino acid contents in fish eyes, in order to gain information on the adaptation to environmental changes. Moreover, specific activities of two hydrolytic enzymes, such as alkaline phosphatase and peroxidase in fish epidermal mucus, together with lactate in blood plasma and cortisol levels, have been investigated for the first time, in order to obtain insights into the effects of hydrocarbons on animal welfare. A multiple co-inertia analysis was also applied to chemical and environmental parameters, in order to explore any possible correlation between different variables. The multivariate approach showed a clear spatial distribution between environmental and chemical variables in Ee, whilst there was an absence of a spatial trend in Tt. Moreover, the chemometric analysis showed a very high correlation between amino acid profiles and environmental variables for both species, confirming the possibility of being used as ecological welfare indices for short-term environmental variations.
Collapse
Affiliation(s)
- Daniela Piazzese
- Dipartimento di Scienze della Terra e del Mare, Università degli Studi di Palermo, Via Archirafi 26, 90123, Palermo, Italy.
| | - Angelo Bonanno
- Istituto per lo studio degli impatti Antropici e Sostenibilità in ambiente marino (IAS-CNR) Consiglio Nazionale delle Ricerche Via del Mare, 3, Torretta Granitola - Campobello di Mazara, 91021, TP, Italy
| | - David Bongiorno
- Dipartimento di Scienze e Tecnologie Biologiche Chimiche e Farmaceutiche, Viale delle Scienze, Ed. 16, 90128, Palermo, Italy
| | - Francesca Falco
- Istituto per le Risorse Biologiche e le Biotecnologie Marine (IRBIM), sede di Mazara del Vallo, via Luigi Vaccara, 65 Mazara del Vallo, TP, Italy
| | - Serena Indelicato
- Dipartimento di Scienze e Tecnologie Biologiche Chimiche e Farmaceutiche, Viale delle Scienze, Ed. 16, 90128, Palermo, Italy
| | - Giacomo Milisenda
- Stazione Zoologica "Anton Dohrn" - Centro interdipartimentale Sicilia, Lungomare Cristoforo Colombo (ex complesso Roosvelt) I, Palermo, 90142, Italy
| | - Irene Vazzana
- Istituto Zooprofilattico della Sicilia, Via G. Marinuzzi 3, 90129, Palermo, Italy
| | - Matteo Cammarata
- Dipartimento di Scienze della Terra e del Mare, Università degli Studi di Palermo, Via Archirafi 26, 90123, Palermo, Italy
| |
Collapse
|
17
|
Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A Selective Review of Multi-Level Omics Data Integration Using Variable Selection. High Throughput 2019; 8:E4. [PMID: 30669303 PMCID: PMC6473252 DOI: 10.3390/ht8010004] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Revised: 12/24/2018] [Accepted: 01/10/2019] [Indexed: 01/02/2023] Open
Abstract
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN 38152, USA.
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06510, USA.
| |
Collapse
|