1
|
Feldman S, Ner-Gaon H, Treister E, Shay T. Comparison and development of cross-study normalization methods for inter-species transcriptional analysis. PLoS One 2024; 19:e0307997. [PMID: 39255285 PMCID: PMC11386461 DOI: 10.1371/journal.pone.0307997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Accepted: 07/16/2024] [Indexed: 09/12/2024] Open
Abstract
Performing joint analysis of gene expression datasets from different experiments can present challenges brought on by multiple factors-differences in equipment, protocols, climate etc. "Cross-study normalization" is a general term for transformations aimed at eliminating such effects, thus making datasets more comparable. However, joint analysis of datasets from different species is rarely done, and there are no dedicated normalization methods for such inter-species analysis. In order to test the usefulness of cross-studies normalization methods for inter-species analysis, we first applied three cross-study normalization methods, EB, DWD and XPN, to RNA sequencing datasets from different species. We then developed a new approach to evaluate the performance of cross-study normalization in eliminating experimental effects, while also maintaining the biologically significant differences between species and conditions. Our results indicate that all normalization methods performed relatively well in the cross-species setting. We found XPN to be better at reducing experimental differences, and found EB to be better at preserving biological differences. Still, according to our in-silico experiments, in all methods it is not possible to enforce the preservation of the biological differences in the normalization process. In addition to the study above, in this work we propose a new dedicated cross-studies and cross-species normalization method. Our aim is to address the shortcoming mentioned above: in the normalization process, we wish to reduce the experimental differences while preserving the biological differences. We term our method as CSN, and base it on the performance evaluation criteria mentioned above. Repeating the same experiments, the CSN method obtained a better and more balanced conservation of biological differences within the datasets compared to existing methods. To summarize, we demonstrate the usefulness of cross-study normalization methods in the inter-species settings, and suggest a dedicated cross-study cross-species normalization method that will hopefully open the way to the development of improved normalization methods for the inter-species settings.
Collapse
Affiliation(s)
- Sofya Feldman
- Dept of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Hadas Ner-Gaon
- Dept of Life Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Eran Treister
- Dept of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Tal Shay
- Dept of Life Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| |
Collapse
|
2
|
Terranova N, Renard D, Shahin MH, Menon S, Cao Y, Hop CECA, Hayes S, Madrasi K, Stodtmann S, Tensfeldt T, Vaddady P, Ellinwood N, Lu J. Artificial Intelligence for Quantitative Modeling in Drug Discovery and Development: An Innovation and Quality Consortium Perspective on Use Cases and Best Practices. Clin Pharmacol Ther 2024; 115:658-672. [PMID: 37716910 DOI: 10.1002/cpt.3053] [Citation(s) in RCA: 21] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 09/11/2023] [Indexed: 09/18/2023]
Abstract
Recent breakthroughs in artificial intelligence (AI) and machine learning (ML) have ushered in a new era of possibilities across various scientific domains. One area where these advancements hold significant promise is model-informed drug discovery and development (MID3). To foster a wider adoption and acceptance of these advanced algorithms, the Innovation and Quality (IQ) Consortium initiated the AI/ML working group in 2021 with the aim of promoting their acceptance among the broader scientific community as well as by regulatory agencies. By drawing insights from workshops organized by the working group and attended by key stakeholders across the biopharma industry, academia, and regulatory agencies, this white paper provides a perspective from the IQ Consortium. The range of applications covered in this white paper encompass the following thematic topics: (i) AI/ML-enabled Analytics for Pharmacometrics and Quantitative Systems Pharmacology (QSP) Workflows; (ii) Explainable Artificial Intelligence and its Applications in Disease Progression Modeling; (iii) Natural Language Processing (NLP) in Quantitative Pharmacology Modeling; and (iv) AI/ML Utilization in Drug Discovery. Additionally, the paper offers a set of best practices to ensure an effective and responsible use of AI, including considering the context of use, explainability and generalizability of models, and having human-in-the-loop. We believe that embracing the transformative power of AI in quantitative modeling while adopting a set of good practices can unlock new opportunities for innovation, increase efficiency, and ultimately bring benefits to patients.
Collapse
Affiliation(s)
- Nadia Terranova
- Quantitative Pharmacology, Merck KGaA, Lausanne, Switzerland
| | - Didier Renard
- Full Development Pharmacometrics, Novartis Pharma AG, Basel, Switzerland
| | | | - Sujatha Menon
- Clinical Pharmacology, Pfizer Inc., Groton, Connecticut, USA
| | - Youfang Cao
- Clinical Pharmacology and Translational Medicine, Eisai Inc., Nutley, New Jersey, USA
| | | | - Sean Hayes
- Quantitative Pharmacology & Pharmacometrics, Merck & Co. Inc., Rahway, New Jersey, USA
| | - Kumpal Madrasi
- Modeling & Simulation, Sanofi, Bridgewater, New Jersey, USA
| | - Sven Stodtmann
- Pharmacometrics, AbbVie Deutschland GmbH & Co. KG, Ludwigshafen, Germany
| | | | - Pavan Vaddady
- Quantitative Clinical Pharmacology, Daiichi Sankyo, Inc., Basking Ridge, New Jersey, USA
| | | | - James Lu
- Clinical Pharmacology, Genentech Inc., South San Francisco, California, USA
| |
Collapse
|
3
|
Borisov N, Tkachev V, Simonov A, Sorokin M, Kim E, Kuzmin D, Karademir-Yilmaz B, Buzdin A. Uniformly shaped harmonization combines human transcriptomic data from different platforms while retaining their biological properties and differential gene expression patterns. Front Mol Biosci 2023; 10:1237129. [PMID: 37745690 PMCID: PMC10511763 DOI: 10.3389/fmolb.2023.1237129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 08/28/2023] [Indexed: 09/26/2023] Open
Abstract
Introduction: Co-normalization of RNA profiles obtained using different experimental platforms and protocols opens avenue for comprehensive comparison of relevant features like differentially expressed genes associated with disease. Currently, most of bioinformatic tools enable normalization in a flexible format that depends on the individual datasets under analysis. Thus, the output data of such normalizations will be poorly compatible with each other. Recently we proposed a new approach to gene expression data normalization termed Shambhala which returns harmonized data in a uniform shape, where every expression profile is transformed into a pre-defined universal format. We previously showed that following shambhalization of human RNA profiles, overall tissue-specific clustering features are strongly retained while platform-specific clustering is dramatically reduced. Methods: Here, we tested Shambhala performance in retention of fold-change gene expression features and other functional characteristics of gene clusters such as pathway activation levels and predicted cancer drug activity scores. Results: Using 6,793 cancer and 11,135 normal tissue gene expression profiles from the literature and experimental datasets, we applied twelve performance criteria for different versions of Shambhala and other methods of transcriptomic harmonization with flexible output data format. Such criteria dealt with the biological type classifiers, hierarchical clustering, correlation/regression properties, stability of drug efficiency scores, and data quality for using machine learning classifiers. Discussion: Shambhala-2 harmonizer demonstrated the best results with the close to 1 correlation and linear regression coefficients for the comparison of training vs validation datasets and more than two times lesser instability for calculation of drug efficiency scores compared to other methods.
Collapse
Affiliation(s)
- Nicolas Borisov
- Omicsway Corp, Walnut, CA, United States
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
| | | | - Alexander Simonov
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
- Oncobox Ltd., Moscow, Russia
| | - Maxim Sorokin
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
- Oncobox Ltd., Moscow, Russia
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, Moscow, Russia
| | - Ella Kim
- Clinic for Neurosurgery, Laboratory of Experimental Neurooncology, Johannes Gutenberg University Medical Centre, Mainz, Germany
| | - Denis Kuzmin
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
| | - Betul Karademir-Yilmaz
- Department of Biochemistry, School of Medicine/Genetic and Metabolic Diseases Research and Investigation Center (GEMHAM) Marmara University, Istanbul, Türkiye
| | - Anton Buzdin
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, Moscow, Russia
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia
- PathoBiology Group, European Organization for Research and Treatment of Cancer (EORTC), Brussels, Belgium
| |
Collapse
|
4
|
Ai N, Yang Z, Yuan H, Ouyang D, Miao R, Ji Y, Liang Y. A distributed sparse logistic regression with $$L_{1/2}$$ regularization for microarray biomarker discovery in cancer classification. Soft comput 2022. [DOI: 10.1007/s00500-022-07551-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
5
|
Borisov N, Buzdin A. Transcriptomic Harmonization as the Way for Suppressing Cross-Platform Bias and Batch Effect. Biomedicines 2022; 10:2318. [PMID: 36140419 PMCID: PMC9496268 DOI: 10.3390/biomedicines10092318] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 09/14/2022] [Accepted: 09/16/2022] [Indexed: 11/16/2022] Open
Abstract
(1) Background: Emergence of methods interrogating gene expression at high throughput gave birth to quantitative transcriptomics, but also posed a question of inter-comparison of expression profiles obtained using different equipment and protocols and/or in different series of experiments. Addressing this issue is challenging, because all of the above variables can dramatically influence gene expression signals and, therefore, cause a plethora of peculiar features in the transcriptomic profiles. Millions of transcriptomic profiles were obtained and deposited in public databases of which the usefulness is however strongly limited due to the inter-comparison issues; (2) Methods: Dozens of methods and software packages that can be generally classified as either flexible or predefined format harmonizers have been proposed, but none has become to the date the gold standard for unification of this type of Big Data; (3) Results: However, recent developments evidence that platform/protocol/batch bias can be efficiently reduced not only for the comparisons of limited transcriptomic datasets. Instead, instruments were proposed for transforming gene expression profiles into the universal, uniformly shaped format that can support multiple inter-comparisons for reasonable calculation costs. This forms a basement for universal indexing of all or most of all types of RNA sequencing and microarray hybridization profiles; (4) Conclusions: In this paper, we attempted to overview the landscape of modern approaches and methods in transcriptomic harmonization and focused on the practical aspects of their application.
Collapse
Affiliation(s)
- Nicolas Borisov
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, 119435 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
| | - Anton Buzdin
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, 119435 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, 117997 Moscow, Russia
- PathoBiology Group, European Organization for Research and Treatment of Cancer (EORTC), 1200 Brussels, Belgium
| |
Collapse
|
6
|
Zhang Y, Sun H, Mandava A, Aevermann BD, Kollmann TR, Scheuermann RH, Qiu X, Qian Y. FastMix: a versatile data integration pipeline for cell type-specific biomarker inference. Bioinformatics 2022; 38:4735-4744. [PMID: 36018232 PMCID: PMC9801972 DOI: 10.1093/bioinformatics/btac585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Revised: 08/18/2022] [Accepted: 08/25/2022] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION Flow cytometry (FCM) and transcription profiling are the two widely used assays in translational immunology research. However, there is no data integration pipeline for analyzing these two types of assays together with experiment variables for biomarker inference. Current FCM data analysis mainly relies on subjective manual gating analysis, which is difficult to be directly integrated with other automated computational methods. Existing deconvolutional analysis of bulk transcriptomics relies on predefined marker genes in the transcriptomics data, which are unavailable for novel cell types and does not utilize the FCM data that provide canonical phenotypic definitions of the cell types. RESULTS We developed a novel analytics pipeline-FastMix-for computational immunology, which integrates flow cytometry, bulk transcriptomics and clinical covariates for identifying cell type-specific gene expression signatures and biomarker genes. FastMix addresses the 'large p, small n' problem in the gene expression and flow cytometry integration analysis via a linear mixed effects model (LMER) for both cross-sectional and longitudinal studies. Its novel moment-based estimator not only reduces bias in parameter estimation but also is more efficient than iterative optimization. The FastMix pipeline also includes a cutting-edge flow cytometry data analysis method-DAFi-for identifying cell populations of interest and their characteristics. Simulation studies showed that FastMix produced smaller type I/II errors than competing methods. Validation using real data of two vaccine studies showed that FastMix identified a consistent set of signature genes as in independent single-cell RNA-seq analysis, producing additional interesting findings. AVAILABILITY AND IMPLEMENTATION Source code of FastMix is publicly available at https://github.com/terrysun0302/FastMix. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Aishwarya Mandava
- Department of Informatics, J. Craig Venter Institute, La Jolla, CA 92037, USA
| | - Brian D Aevermann
- Department of Informatics, J. Craig Venter Institute, La Jolla, CA 92037, USA
| | - Tobias R Kollmann
- Systems Vaccinology, Telethon Kids Institute, Perth Children’s Hospital, University of Western Australia, Nedlands, WA 6009, Australia
| | - Richard H Scheuermann
- Department of Informatics, J. Craig Venter Institute, La Jolla, CA 92037, USA,Department of Pathology, University of California, San Diego, La Jolla, CA 92093, USA
| | - Xing Qiu
- To whom correspondence should be addressed. or
| | - Yu Qian
- To whom correspondence should be addressed. or
| |
Collapse
|
7
|
Junet V, Farrés J, Mas JM, Daura X. CuBlock: a cross-platform normalization method for gene-expression microarrays. Bioinformatics 2021; 37:2365-2373. [PMID: 33609102 PMCID: PMC8388031 DOI: 10.1093/bioinformatics/btab105] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 02/04/2021] [Accepted: 02/16/2021] [Indexed: 12/28/2022] Open
Abstract
Motivation Cross-(multi)platform normalization of gene-expression microarray data remains an unresolved issue. Despite the existence of several algorithms, they are either constrained by the need to normalize all samples of all platforms together, compromising scalability and reuse, by adherence to the platforms of a specific provider, or simply by poor performance. In addition, many of the methods presented in the literature have not been specifically tested against multi-platform data and/or other methods applicable in this context. Thus, we set out to develop a normalization algorithm appropriate for gene-expression studies based on multiple, potentially large microarray sets collected along multiple platforms and at different times, applicable in systematic studies aimed at extracting knowledge from the wealth of microarray data available in public repositories; for example, for the extraction of Real-World Data to complement data from Randomized Controlled Trials. Our main focus or criterion for performance was on the capacity of the algorithm to properly separate samples from different biological groups. Results We present CuBlock, an algorithm addressing this objective, together with a strategy to validate cross-platform normalization methods. To validate the algorithm and benchmark it against existing methods, we used two distinct datasets, one specifically generated for testing and standardization purposes and one from an actual experimental study. Using these datasets, we benchmarked CuBlock against ComBat (Johnson et al., 2007), UPC (Piccolo et al., 2013), YuGene (Lê Cao et al., 2014), DBNorm (Meng et al., 2017), Shambhala (Borisov et al., 2019) and a simple log2 transform as reference. We note that many other popular normalization methods are not applicable in this context. CuBlock was the only algorithm in this group that could always and clearly differentiate the underlying biological groups after mixing the data, from up to six different platforms in this study. Availability and implementation CuBlock can be downloaded from https://www.mathworks.com/matlabcentral/fileexchange/77882-cublock. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Valentin Junet
- Anaxomics Biotech SL, Barcelona, 08008, Spain.,Institute of Biotechnology and Biomedicine, Universitat Autònoma de Barcelona, 08193, Spain
| | | | - José M Mas
- Anaxomics Biotech SL, Barcelona, 08008, Spain
| | - Xavier Daura
- Institute of Biotechnology and Biomedicine, Universitat Autònoma de Barcelona, 08193, Spain.,Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, 08010, Spain
| |
Collapse
|
8
|
Lung PY, Zhong D, Pang X, Li Y, Zhang J. Maximizing the reusability of gene expression data by predicting missing metadata. PLoS Comput Biol 2020; 16:e1007450. [PMID: 33156882 PMCID: PMC7673503 DOI: 10.1371/journal.pcbi.1007450] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2019] [Revised: 11/18/2020] [Accepted: 10/09/2020] [Indexed: 11/18/2022] Open
Abstract
Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.
Collapse
Affiliation(s)
- Pei-Yau Lung
- Department of Statistics, Florida State University, Tallahassee, United States of America
| | - Dongrui Zhong
- Department of Statistics, Florida State University, Tallahassee, United States of America
| | - Xiaodong Pang
- Insilicom LLC, Tallahassee, United States of America
| | - Yan Li
- Department of Breast Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, China
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, United States of America
- * E-mail:
| |
Collapse
|
9
|
Jiang A, Hilton LK, Tang J, Rushton CK, Grande BM, Scott DW, Morin RD. PRPS-ST: A protocol-agnostic self-training method for gene expression-based classification of blood cancers. Blood Cancer Discov 2020; 1:244-257. [PMID: 33392514 DOI: 10.1158/2643-3230.bcd-20-0076] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Gene expression classifiers are gaining increasing popularity for stratifying tumors into subgroups with distinct biological features. A fundamental limitation shared by current classifiers is the requirement for comparable training and testing data sets. Here, we describe a self-training implementation of our probability ratio-based classification prediction score method (PRPS-ST), which facilitates the porting of existing classification models to other gene expression data sets. In comparison to gold standards, we demonstrate favorable performance of PRPS-ST in gene expression-based classification of DLBCL and B-ALL using a diverse variety of gene expression data types and pre-processing methods, including in classifications with a high degree of class imbalance. Tumors classified by our method were significantly enriched for prototypical genetic features of their respective subgroups. Interestingly, this included cases that were unclassifiable by established methods, implying the potential enhanced sensitivity of PRPS-ST.
Collapse
Affiliation(s)
- Aixiang Jiang
- Department of Pathology & Laboratory Medicine, University of British Columbia, Vancouver, BC, Canada.,BC Cancer Centre for Lymphoid Cancer, Vancouver, BC, Canada.,Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada
| | - Laura K Hilton
- BC Cancer Centre for Lymphoid Cancer, Vancouver, BC, Canada.,Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada.,Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Jeffrey Tang
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada
| | - Christopher K Rushton
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada
| | - Bruno M Grande
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada
| | - David W Scott
- BC Cancer Centre for Lymphoid Cancer, Vancouver, BC, Canada.,Department of Medicine, University of British Columbia, Vancouver, BC, Canada
| | - Ryan D Morin
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada.,Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| |
Collapse
|