1
|
Tuly SR, Ranjbari S, Murat EA, Arslanturk S. From Silos to Synthesis: A comprehensive review of domain adaptation strategies for multi-source data integration in healthcare. Comput Biol Med 2025; 191:110108. [PMID: 40209575 DOI: 10.1016/j.compbiomed.2025.110108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Revised: 03/27/2025] [Accepted: 03/27/2025] [Indexed: 04/12/2025]
Abstract
BACKGROUND The integration of data from diverse sources is not only crucial for addressing data scarcity in health informatics but also enables the use of complementary information from multiple datasets. However, the isolated nature of data collected from disparate sources (referred to as 'Silos') presents significant challenges in multi-source data integration due to inherent heterogeneity and differences in data structures, formats, and standards. Domain adaptation emerges as a key framework to transition from 'Silos' to 'Synthesis' by measuring and mitigating such discrepancies, enabling uniform representation and harmonization of multi-source data. METHODS This study explores different approaches to healthcare data integration, highlighting the challenges associated with each type and discussing both general-purpose and healthcare-specific adaptation methods. We examine key research challenges and evaluate leading domain adaptation approaches, demonstrating their effectiveness and limitations in advancing healthcare data integration. RESULTS The findings highlight the potential of domain adaptation methods to significantly improve healthcare data integration while laying a foundation for future research. CONCLUSION Current research often lacks a comprehensive analysis of how domain adaptation can effectively address the challenges associated with integrating multi-source and multi-modal healthcare datasets. This study serves as a valuable resource for healthcare professionals and researchers, providing guidance on leveraging domain adaptation techniques to mitigate domain discrepancies in healthcare data integration.
Collapse
Affiliation(s)
- Shelia Rahman Tuly
- Department of Computer Science, Wayne State University, 5057 Woodward Ave, Detroit, 48201, MI, USA.
| | - Sima Ranjbari
- Department of Computer Science, Wayne State University, 5057 Woodward Ave, Detroit, 48201, MI, USA.
| | - Ekrem Alper Murat
- Department of Industrial and Systems Engineering, Wayne State University, 4th Street, Detroit, 48201, MI, USA.
| | - Suzan Arslanturk
- Department of Computer Science, Wayne State University, 5057 Woodward Ave, Detroit, 48201, MI, USA.
| |
Collapse
|
2
|
Zhang C, Liu L, Zhang Y, Li M, Fang S, Kang Q, Chen A, Xu X, Zhang Y, Li Y. spatiAlign: an unsupervised contrastive learning model for data integration of spatially resolved transcriptomics. Gigascience 2024; 13:giae042. [PMID: 39028588 PMCID: PMC11258913 DOI: 10.1093/gigascience/giae042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 01/08/2024] [Accepted: 06/21/2024] [Indexed: 07/21/2024] Open
Abstract
BACKGROUND Integrative analysis of spatially resolved transcriptomics datasets empowers a deeper understanding of complex biological systems. However, integrating multiple tissue sections presents challenges for batch effect removal, particularly when the sections are measured by various technologies or collected at different times. FINDINGS We propose spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and the spatial location of cells, to integrate multiple tissue sections. It enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings but also in the reconstructed full expression space. CONCLUSIONS In benchmarking analysis, spatiAlign outperforms state-of-the-art methods in learning joint and discriminative representations for tissue sections, each potentially characterized by complex batch effects or distinct biological characteristics. Furthermore, we demonstrate the benefits of spatiAlign for the integrative analysis of time-series brain sections, including spatial clustering, differential expression analysis, and particularly trajectory inference that requires a corrected gene expression matrix.
Collapse
Affiliation(s)
| | - Lin Liu
- BGI Research, Shenzhen 518083, China
| | | | - Mei Li
- BGI Research, Shenzhen 518083, China
| | - Shuangsang Fang
- BGI Research, Shenzhen 518083, China
- BGI Research, Beijing 102601, China
| | | | - Ao Chen
- BGI Research, Shenzhen 518083, China
- BGI Research, Chongqing 401329, China
| | - Xun Xu
- BGI Research, Wuhan 430074, China
| | - Yong Zhang
- BGI Research, Shenzhen 518083, China
- BGI Research, Wuhan 430074, China
- Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China
| | - Yuxiang Li
- BGI Research, Shenzhen 518083, China
- BGI Research, Wuhan 430074, China
- Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China
| |
Collapse
|
3
|
Adamer MF, Brüningk SC, Tejada-Arranz A, Estermann F, Basler M, Borgwardt K. reComBat: batch-effect removal in large-scale multi-source gene-expression data integration. BIOINFORMATICS ADVANCES 2022; 2:vbac071. [PMID: 36699372 PMCID: PMC9710604 DOI: 10.1093/bioadv/vbac071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 09/01/2022] [Accepted: 09/26/2022] [Indexed: 01/28/2023]
Abstract
Motivation With the steadily increasing abundance of omics data produced all over the world under vastly different experimental conditions residing in public databases, a crucial step in many data-driven bioinformatics applications is that of data integration. The challenge of batch-effect removal for entire databases lies in the large number of batches and biological variation, which can result in design matrix singularity. This problem can currently not be solved satisfactorily by any common batch-correction algorithm. Results We present reComBat, a regularized version of the empirical Bayes method to overcome this limitation and benchmark it against popular approaches for the harmonization of public gene-expression data (both microarray and bulkRNAsq) of the human opportunistic pathogen Pseudomonas aeruginosa. Batch-effects are successfully mitigated while biologically meaningful gene-expression variation is retained. reComBat fills the gap in batch-correction approaches applicable to large-scale, public omics databases and opens up new avenues for data-driven analysis of complex biological processes beyond the scope of a single study. Availability and implementation The code is available at https://github.com/BorgwardtLab/reComBat, all data and evaluation code can be found at https://github.com/BorgwardtLab/batchCorrectionPublicData. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | | | | | - Marek Basler
- Biozentrum, University of Basel, Basel 4056, Switzerland
| | - Karsten Borgwardt
- Department of Biosystems Science and Engineering, ETH Zurich, Basel 4058, Switzerland,Swiss Institute for Bioinformatics (SIB), Lausanne 1015, Switzerland
| |
Collapse
|
4
|
Zhang X, Ye Z, Chen J, Qiao F. AMDBNorm: an approach based on distribution adjustment to eliminate batch effects of gene expression data. Brief Bioinform 2021; 23:6485011. [PMID: 34958674 DOI: 10.1093/bib/bbab528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 10/16/2021] [Accepted: 11/14/2021] [Indexed: 11/14/2022] Open
Abstract
Batch effects explain a large part of the noise when merging gene expression data. Removing irrelevant variations introduced by batch effects plays an important role in gene expression studies. To obtain reliable differential analysis results, it is necessary to remove the variation caused by technical conditions between different batches while preserving biological variation. Usually, merging data directly with batch effects leads to a sharp rise in false positives. Although some methods of batch correction have been developed, they have some drawbacks. In this study, we develop a new algorithm, adjustment mean distribution-based normalization (AMDBNorm), which is based on a probability distribution to correct batch effects while preserving biological variation. AMDBNorm solves the defects of the existing batch correction methods. We compared several popular methods of batch correction with AMDBNorm using two real gene expression datasets with batch effects and analyzed the results of batch correction from the visual and quantitative perspectives. To ensure the biological variation was well protected, the effects of the batch correction methods were verified by hierarchical cluster analysis. The results showed that the AMDBNorm algorithm could remove batch effects of gene expression data effectively and retain more biological variation than other methods. Our approach provides the researchers with reliable data support in the study of differential gene expression analysis and prognostic biomarker selection.
Collapse
Affiliation(s)
- Xu Zhang
- School of Mathematics and Statistics, Southwest University, China
| | | | - Jing Chen
- School of Science, Southwest University of Science and Technology, China
| | | |
Collapse
|
5
|
Abstract
Carrying out large multicenter studies is one of the key goals to be achieved towards a faster transfer of the radiomics approach in the clinical setting. This requires large-scale radiomics data analysis, hence the need for integrating radiomic features extracted from images acquired in different centers. This is challenging as radiomic features exhibit variable sensitivity to differences in scanner model, acquisition protocols and reconstruction settings, which is similar to the so-called 'batch-effects' in genomics studies. In this review we discuss existing methods to perform data integration with the aid of reducing the unwanted variation associated with batch effects. We also discuss the future potential role of deep learning methods in providing solutions for addressing radiomic multicentre studies.
Collapse
Affiliation(s)
- R Da-Ano
- LaTiM, INSERM, UMR 1101, Univ Brest, Brest, France
| | - D Visvikis
- LaTiM, INSERM, UMR 1101, Univ Brest, Brest, France
- equally contributed
| | - M Hatt
- LaTiM, INSERM, UMR 1101, Univ Brest, Brest, France
- equally contributed
| |
Collapse
|
6
|
A Novel Statistical Method to Diagnose, Quantify and Correct Batch Effects in Genomic Studies. Sci Rep 2017; 7:10849. [PMID: 28883548 PMCID: PMC5589920 DOI: 10.1038/s41598-017-11110-6] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2017] [Accepted: 08/18/2017] [Indexed: 01/01/2023] Open
Abstract
Genome projects now generate large-scale data often produced at various time points by different laboratories using multiple platforms. This increases the potential for batch effects. Currently there are several batch evaluation methods like principal component analysis (PCA; mostly based on visual inspection), and sometimes they fail to reveal all of the underlying batch effects. These methods can also lead to the risk of unintentionally correcting biologically interesting factors attributed to batch effects. Here we propose a novel statistical method, finding batch effect (findBATCH), to evaluate batch effect based on probabilistic principal component and covariates analysis (PPCCA). The same framework also provides a new approach to batch correction, correcting batch effect (correctBATCH), which we have shown to be a better approach to traditional PCA-based correction. We demonstrate the utility of these methods using two different examples (breast and colorectal cancers) by merging gene expression data from different studies after diagnosing and correcting for batch effects and retaining the biological effects. These methods, along with conventional visual inspection-based PCA, are available as a part of an R package exploring batch effect (exploBATCH; https://github.com/syspremed/exploBATCH).
Collapse
|
7
|
Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, Weiss-Solís DY, Duque R, Bersini H, Nowé A. Batch effect removal methods for microarray gene expression data integration: a survey. Brief Bioinform 2012; 14:469-90. [PMID: 22851511 DOI: 10.1093/bib/bbs037] [Citation(s) in RCA: 216] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Genomic data integration is a key goal to be achieved towards large-scale genomic data analysis. This process is very challenging due to the diverse sources of information resulting from genomics experiments. In this work, we review methods designed to combine genomic data recorded from microarray gene expression (MAGE) experiments. It has been acknowledged that the main source of variation between different MAGE datasets is due to the so-called 'batch effects'. The methods reviewed here perform data integration by removing (or more precisely attempting to remove) the unwanted variation associated with batch effects. They are presented in a unified framework together with a wide range of evaluation tools, which are mandatory in assessing the efficiency and the quality of the data integration process. We provide a systematic description of the MAGE data integration methodology together with some basic recommendation to help the users in choosing the appropriate tools to integrate MAGE data for large-scale analysis; and also how to evaluate them from different perspectives in order to quantify their efficiency. All genomic data used in this study for illustration purposes were retrieved from InSilicoDB http://insilico.ulb.ac.be.
Collapse
Affiliation(s)
- Cosmin Lazar
- Como, Vrije Universiteit Brussel, Pleinlaanz, 1050 Brussels, Belgium.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Chang H, Rha SY, Jeung HC, Jung JJ, Kim TS, Kwon HJ, Kim BS, Chung HC. Identification of genes related to a synergistic effect of taxane and suberoylanilide hydroxamic acid combination treatment in gastric cancer cells. J Cancer Res Clin Oncol 2010; 136:1901-13. [PMID: 20217129 DOI: 10.1007/s00432-010-0849-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2009] [Accepted: 01/25/2010] [Indexed: 10/19/2022]
Abstract
PURPOSE We evaluated the cytotoxic effects of combining suberoylanilide hydroxamic acid (SAHA), a histone deacetylase inhibitor, with taxanes in human gastric cancer cell lines and assessed the pre-treatment difference of gene expression to identify genes that could potentially mediate the cytotoxic response. METHODS Gastric cancer cell lines were treated with SAHA and paclitaxel or docetaxel, and the synergistic interaction between the drugs was evaluated in vitro using the combination index (CI) method. We performed significance analysis of microarray (SAM) to identify chemosensitivity-related genes in gastric cancer cell lines that were concomitantly treated with SAHA and taxane. We generated a correlation matrix between gene expression and CI values to identify genes whose expression correlated with a combined effect of taxanes and SAHA. RESULTS Combination treatment with taxane and SAHA had a synergistic cytotoxic effect against taxane-resistant gastric cancer cells. We identified 49 chemosensitivity-related genes via SAM analysis. Among them, nine common genes (SLIT2, REEP2, EFEMP2, CDC42SE1, FSD1, POU1F1, ZNF79, ETNK1, and DOCK5) were extracted from the subsequent correlation matrix analysis. CONCLUSIONS The combination of taxane and SAHA could be efficacious for the treatment of gastric cancer. The genes that were related to the synergistic response to taxane and SAHA could serve as surrogate biomarkers to predict the therapeutic response in gastric cancer patients.
Collapse
Affiliation(s)
- Hyun Chang
- Division of Medical Oncology, Department of Internal Medicine, Yonsei Cancer Center, Yonsei University College of Medicine, #134 Shinchon-Dong, Seodaemoon-gu, Seoul 120-752, Korea
| | | | | | | | | | | | | | | |
Collapse
|