1
|
Song L, Reese JG, Platt MA, Lewis C, Eardley-Brunt ASJ, Sun B, Ansorge O, Vallance C. Advancing atmospheric solids analysis probe mass spectrometry applications: a multifaceted approach to optimising clinical data set generation. Analyst 2025. [PMID: 40372210 PMCID: PMC12080459 DOI: 10.1039/d5an00166h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2025] [Accepted: 05/06/2025] [Indexed: 05/16/2025]
Abstract
The use of rapid mass spectrometry techniques, such as atmospheric-solids-analysis-probe mass spectrometry (ASAP-MS), in the analysis of metabolite patterns in clinical samples holds significant promise for developing new diagnostic tools and enabling rapid disease screening. The rapid measurement times, ease of use, and relatively low cost of ASAP-MS makes it an appealing option for use in clinical settings. However, despite the potential of such approaches, a number of important experimental considerations are often overlooked. As well as instrument-specific choices and settings, these include the treatment of background noise and/or contaminant peaks in the mass spectra, and the influence of consumables, different users, and batch effects more generally. The present study assesses the impact of these various factors on measurement accuracy and reproducibility, using human brain and cerebrospinal fluid samples as examples. Based on our results, we make a series of recommendations relating to optimisation of measurement and cleaning protocols, consumable selection, and batch effect detection and correction, in order to optimise the reliability and reproducibility of ASAP-MS measurements in clinical settings.
Collapse
Affiliation(s)
- Liwen Song
- Department of Chemistry, University of Oxford, Chemistry Research Laboratory, 12 Mansfield Rd, Oxford OX1 3TA, UK.
| | - Jasmine G Reese
- Academic Unit of Neuropathology, Nuffield Department of Clinical Neurosciences, University of Oxford, John Radcliffe Hospital, Oxford OX3 9DU, UK
| | - Michael A Platt
- Department of Chemistry, University of Oxford, Chemistry Research Laboratory, 12 Mansfield Rd, Oxford OX1 3TA, UK.
| | - Claire Lewis
- Academic Unit of Neuropathology, Nuffield Department of Clinical Neurosciences, University of Oxford, John Radcliffe Hospital, Oxford OX3 9DU, UK
| | - Annabel S J Eardley-Brunt
- Department of Chemistry, University of Oxford, Chemistry Research Laboratory, 12 Mansfield Rd, Oxford OX1 3TA, UK.
| | - Bo Sun
- Academic Unit of Neuropathology, Nuffield Department of Clinical Neurosciences, University of Oxford, John Radcliffe Hospital, Oxford OX3 9DU, UK
| | - Olaf Ansorge
- Academic Unit of Neuropathology, Nuffield Department of Clinical Neurosciences, University of Oxford, John Radcliffe Hospital, Oxford OX3 9DU, UK
| | - Claire Vallance
- Department of Chemistry, University of Oxford, Chemistry Research Laboratory, 12 Mansfield Rd, Oxford OX1 3TA, UK.
| |
Collapse
|
2
|
Yang R, Celino-Brady FT, Dunleavy JEM, Vigh-Conrad KA, Atkins GR, Hvasta RL, Pombar CRX, Yatsenko AN, Orwig KE, O'Bryan MK, Lima AC, Conrad DF. SATINN v2: automated image analysis for mouse testis histology with multi-laboratory data integration†. Biol Reprod 2025; 112:996-1014. [PMID: 39961022 DOI: 10.1093/biolre/ioaf033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Revised: 11/08/2024] [Accepted: 02/16/2025] [Indexed: 03/21/2025] Open
Abstract
Analysis of testis histology is fundamental to the study of male fertility, but it is a slow task with a high skill threshold. Here, we describe new neural network models for the automated classification of cell types and tubule stages from whole-slide brightfield images of mouse testis. The cell type classifier recognizes 14 cell types, including multiple steps of meiosis I prophase, with an external validation accuracy of 96%. The tubule stage classifier distinguishes all 12 canonical tubule stages with external validation accuracy of 63%, which increases to 96% when allowing for ±1 stage tolerance. We addressed generalizability of SATINN, through extensive training diversification and testing on external (non-training population) wildtype and mutant datasets. This allowed us to use SATINN to successfully process data generated in multiple laboratories. We used SATINN to analyze testis images from eight different mutant lines, generated from three different labs with a range of tissue processing protocols. Finally, we show that it is possible to use SATINN output to cluster histology images in latent space, which, when applied to the eight mutant lines, reveals known relationships in their pathology. This work represents significant progress towards a tool for robust, automated testis histopathology that can be used by multiple labs.
Collapse
Affiliation(s)
- Ran Yang
- Division of Genetics, Oregon National Primate Research Center, Oregon Health and Science University, Portland, OR, United States
| | - Fritzie T Celino-Brady
- Division of Genetics, Oregon National Primate Research Center, Oregon Health and Science University, Portland, OR, United States
| | - Jessica E M Dunleavy
- School of Biosciences and Bio21 Molecular Science and Biotechnology Institute, Faculty of Science, The University of Melbourne, Melbourne, VIC, Australia
| | - Katinka A Vigh-Conrad
- Division of Genetics, Oregon National Primate Research Center, Oregon Health and Science University, Portland, OR, United States
| | - Georgia R Atkins
- Department of Obstetrics, Gynecology and Reproductive Sciences, Magee-Womens Research Institute, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States
- Molecular Genetics and Developmental Biology Graduate Program, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States
| | - Rachel L Hvasta
- Department of Obstetrics, Gynecology and Reproductive Sciences, Magee-Womens Research Institute, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States
| | - Christopher R X Pombar
- Department of Obstetrics, Gynecology and Reproductive Sciences, Magee-Womens Research Institute, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States
| | - Alexander N Yatsenko
- Department of Obstetrics, Gynecology and Reproductive Sciences, Magee-Womens Research Institute, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States
| | - Kyle E Orwig
- Department of Obstetrics, Gynecology and Reproductive Sciences, Magee-Womens Research Institute, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States
| | - Moira K O'Bryan
- School of Biosciences and Bio21 Molecular Science and Biotechnology Institute, Faculty of Science, The University of Melbourne, Melbourne, VIC, Australia
| | - Ana C Lima
- Division of Genetics, Oregon National Primate Research Center, Oregon Health and Science University, Portland, OR, United States
| | - Donald F Conrad
- Division of Genetics, Oregon National Primate Research Center, Oregon Health and Science University, Portland, OR, United States
| |
Collapse
|
3
|
Lvovs D, Creason AL, Levine SS, Noble M, Mahurkar A, White O, Fertig EJ. Balancing ethical data sharing and open science for reproducible research in biomedical data science. Cell Rep Med 2025; 6:102080. [PMID: 40239625 PMCID: PMC12047515 DOI: 10.1016/j.xcrm.2025.102080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2025] [Revised: 03/19/2025] [Accepted: 03/19/2025] [Indexed: 04/18/2025]
Abstract
Analyses of large-scale health data in biomedical data science can help uncover new treatments and deepen our understanding of disease and fundamental biology. Here we examine the balance between ethical and responsible data sharing and open science practices that are essential for reproducible research in biomedical data science.
Collapse
Affiliation(s)
- Dmitrijs Lvovs
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA; Department of Medicine, Division of Hematology/Oncology, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Allison L Creason
- Knight Cancer Institute, Oregon Health Science University, Portland, OR, USA; Biomedical Engineering Department, Oregon Health & Science University, Portland, OR, USA
| | - Stuart S Levine
- BioMicro Center, Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | | | - Anup Mahurkar
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA; University of Maryland - Institute for Health Computing, Bethesda, MD, USA
| | - Owen White
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Elana J Fertig
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA; Department of Medicine, Division of Hematology/Oncology, University of Maryland School of Medicine, Baltimore, MD, USA; University of Maryland - Institute for Health Computing, Bethesda, MD, USA; Greenebaum Comprehensive Cancer Center, University of Maryland School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
4
|
Yu Y, Mai Y, Zheng Y, Shi L. Assessing and mitigating batch effects in large-scale omics studies. Genome Biol 2024; 25:254. [PMID: 39363244 PMCID: PMC11447944 DOI: 10.1186/s13059-024-03401-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 09/23/2024] [Indexed: 10/05/2024] Open
Abstract
Batch effects in omics data are notoriously common technical variations unrelated to study objectives, and may result in misleading outcomes if uncorrected, or hinder biomedical discovery if over-corrected. Assessing and mitigating batch effects is crucial for ensuring the reliability and reproducibility of omics data and minimizing the impact of technical variations on biological interpretation. In this review, we highlight the profound negative impact of batch effects and the urgent need to address this challenging problem in large-scale omics studies. We summarize potential sources of batch effects, current progress in evaluating and correcting them, and consortium efforts aiming to tackle them.
Collapse
Affiliation(s)
- Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
| | - Yuanbang Mai
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
- Cancer Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.
- International Human Phenome Institutes (Shanghai), Shanghai, China.
| |
Collapse
|
5
|
Hui HWH, Kong W, Goh WWB. Thinking points for effective batch correction on biomedical data. Brief Bioinform 2024; 25:bbae515. [PMID: 39397427 PMCID: PMC11471903 DOI: 10.1093/bib/bbae515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 09/11/2024] [Accepted: 10/01/2024] [Indexed: 10/15/2024] Open
Abstract
Batch effects introduce significant variability into high-dimensional data, complicating accurate analysis and leading to potentially misleading conclusions if not adequately addressed. Despite technological and algorithmic advancements in biomedical research, effectively managing batch effects remains a complex challenge requiring comprehensive considerations. This paper underscores the necessity of a flexible and holistic approach for selecting batch effect correction algorithms (BECAs), advocating for proper BECA evaluations and consideration of artificial intelligence-based strategies. We also discuss key challenges in batch effect correction, including the importance of uncovering hidden batch factors and understanding the impact of design imbalance, missing values, and aggressive correction. Our aim is to provide researchers with a robust framework for effective batch effects management and enhancing the reliability of high-dimensional data analyses.
Collapse
Affiliation(s)
- Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
| | - Weijia Kong
- Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, Singapore 636921, Singapore
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, 59 Nanyang Dr, Singapore 636921, Singapore
- Center of AI in Medicine, Nanyang Technological University, 59 Nanyang Dr, Singapore 636921, Singapore
- Division of Neurology, Department of Brain Sciences, Faculty of Medicine, Imperial College London, Burlington Danes, The Hammersmith Hospital, Du Cane Road, London W12 0NN, United Kingdom
| |
Collapse
|
6
|
Goh WWB, Kabir MN, Yoo S, Wong L. Ten quick tips for ensuring machine learning model validity. PLoS Comput Biol 2024; 20:e1012402. [PMID: 39298376 DOI: 10.1371/journal.pcbi.1012402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/21/2024] Open
Abstract
Artificial Intelligence (AI) and Machine Learning (ML) models are increasingly deployed on biomedical and health data to shed insights on biological mechanism, predict disease outcomes, and support clinical decision-making. However, ensuring model validity is challenging. The 10 quick tips described here discuss useful practices on how to check AI/ML models from 2 perspectives-the user and the developer.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
- Center of AI in Medicine, Nanyang Technological University, Singapore, Singapore
- Division of Neurology, Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
| | - Mohammad Neamul Kabir
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
| | - Sehwan Yoo
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| |
Collapse
|
7
|
Goldstein Y, Cohen OT, Wald O, Bavli D, Kaplan T, Benny O. Particle uptake in cancer cells can predict malignancy and drug resistance using machine learning. SCIENCE ADVANCES 2024; 10:eadj4370. [PMID: 38809990 PMCID: PMC11314625 DOI: 10.1126/sciadv.adj4370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 04/23/2024] [Indexed: 05/31/2024]
Abstract
Tumor heterogeneity is a primary factor that contributes to treatment failure. Predictive tools, capable of classifying cancer cells based on their functions, may substantially enhance therapy and extend patient life span. The connection between cell biomechanics and cancer cell functions is used here to classify cells through mechanical measurements, via particle uptake. Machine learning (ML) was used to classify cells based on single-cell patterns of uptake of particles with diverse sizes. Three pairs of human cancer cell subpopulations, varied in their level of drug resistance or malignancy, were studied. Cells were allowed to interact with fluorescently labeled polystyrene particles ranging in size from 0.04 to 3.36 μm and analyzed for their uptake patterns using flow cytometry. ML algorithms accurately classified cancer cell subtypes with accuracy rates exceeding 95%. The uptake data were especially advantageous for morphologically similar cell subpopulations. Moreover, the uptake data were found to serve as a form of "normalization" that could reduce variation in repeated experiments.
Collapse
Affiliation(s)
- Yoel Goldstein
- Institute for Drug Research, The School of Pharmacy, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Ora T. Cohen
- Institute for Drug Research, The School of Pharmacy, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Ori Wald
- Department of Cardiothoracic Surgery, Hadassah Medical Center, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Danny Bavli
- Department of Stem Cell and Regenerative Biology, Harvard Stem Cell Institute, Harvard University, Cambridge, MA, USA
| | - Tommy Kaplan
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
- Department of Developmental Biology and Cancer Research, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Ofra Benny
- Institute for Drug Research, The School of Pharmacy, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| |
Collapse
|
8
|
Zhou R, Ng SK, Sung JJY, Goh WWB, Wong SH. Data pre-processing for analyzing microbiome data - A mini review. Comput Struct Biotechnol J 2023; 21:4804-4815. [PMID: 37841330 PMCID: PMC10569954 DOI: 10.1016/j.csbj.2023.10.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 10/01/2023] [Accepted: 10/01/2023] [Indexed: 10/17/2023] Open
Abstract
The human microbiome is an emerging research frontier due to its profound impacts on health. High-throughput microbiome sequencing enables studying microbial communities but suffers from analytical challenges. In particular, the lack of dedicated preprocessing methods to improve data quality impedes effective minimization of biases prior to downstream analysis. This review aims to address this gap by providing a comprehensive overview of preprocessing techniques relevant to microbiome research. We outline a typical workflow for microbiome data analysis. Preprocessing methods discussed include quality filtering, batch effect correction, imputation of missing values, normalization, and data transformation. We highlight strengths and limitations of each technique to serve as a practical guide for researchers and identify areas needing further methodological development. Establishing robust, standardized preprocessing will be essential for drawing valid biological conclusions from microbiome studies.
Collapse
Affiliation(s)
- Ruwen Zhou
- Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Road, 308232, Singapore
| | - Siu Kin Ng
- Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Road, 308232, Singapore
| | - Joseph Jao Yiu Sung
- Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Road, 308232, Singapore
- Department of Gastroenterology and Hepatology, Tan Tock Seng Hospital, National Healthcare Group, 11 Jalan Tan Tock Seng, 308433, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Road, 308232, Singapore
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, 59 Nanyang Drive, 636921, Singapore
| | - Sunny Hei Wong
- Lee Kong Chian School of Medicine, Nanyang Technological University, 11 Mandalay Road, 308232, Singapore
- Department of Gastroenterology and Hepatology, Tan Tock Seng Hospital, National Healthcare Group, 11 Jalan Tan Tock Seng, 308433, Singapore
| |
Collapse
|
9
|
Yu Y, Zhang N, Mai Y, Ren L, Chen Q, Cao Z, Chen Q, Liu Y, Hou W, Yang J, Hong H, Xu J, Tong W, Dong L, Shi L, Fang X, Zheng Y. Correcting batch effects in large-scale multiomics studies using a reference-material-based ratio method. Genome Biol 2023; 24:201. [PMID: 37674217 PMCID: PMC10483871 DOI: 10.1186/s13059-023-03047-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 05/18/2023] [Indexed: 09/08/2023] Open
Abstract
BACKGROUND Batch effects are notoriously common technical variations in multiomics data and may result in misleading outcomes if uncorrected or over-corrected. A plethora of batch-effect correction algorithms are proposed to facilitate data integration. However, their respective advantages and limitations are not adequately assessed in terms of omics types, the performance metrics, and the application scenarios. RESULTS As part of the Quartet Project for quality control and data integration of multiomics profiling, we comprehensively assess the performance of seven batch effect correction algorithms based on different performance metrics of clinical relevance, i.e., the accuracy of identifying differentially expressed features, the robustness of predictive models, and the ability of accurately clustering cross-batch samples into their own donors. The ratio-based method, i.e., by scaling absolute feature values of study samples relative to those of concurrently profiled reference material(s), is found to be much more effective and broadly applicable than others, especially when batch effects are completely confounded with biological factors of study interests. We further provide practical guidelines for implementing the ratio based approach in increasingly large-scale multiomics studies. CONCLUSIONS Multiomics measurements are prone to batch effects, which can be effectively corrected using ratio-based scaling of the multiomics data. Our study lays the foundation for eliminating batch effects at a ratio scale.
Collapse
Affiliation(s)
- Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Naixin Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yuanbang Mai
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Luyao Ren
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qiaochu Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Zehui Cao
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qingwang Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yaqing Liu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Wanwan Hou
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Jingcheng Yang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
- Greater Bay Area Institute of Precision Medicine, Guangzhou, Guangdong, China
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Joshua Xu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | | | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.
- International Human Phenome Institutes, Shanghai, China.
| | - Xiang Fang
- National Institute of Metrology, Beijing, China.
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.
| |
Collapse
|
10
|
Goh WWB, Hui HWH, Wong L. How missing value imputation is confounded with batch effects and what you can do about it. Drug Discov Today 2023; 28:103661. [PMID: 37301250 DOI: 10.1016/j.drudis.2023.103661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 05/31/2023] [Accepted: 06/05/2023] [Indexed: 06/12/2023]
Abstract
In data-processing pipelines, upstream steps can influence downstream processes because of their sequential nature. Among these data-processing steps, batch effect (BE) correction (BEC) and missing value imputation (MVI) are crucial for ensuring data suitability for advanced modeling and reducing the likelihood of false discoveries. Although BEC-MVI interactions are not well studied, they are ultimately interdependent. Batch sensitization can improve the quality of MVI. Conversely, accounting for missingness also improves proper BE estimation in BEC. Here, we discuss how BEC and MVI are interconnected and interdependent. We show how batch sensitization can improve any MVI and bring attention to the idea of BE-associated missing values (BEAMs). Finally, we discuss how batch-class imbalance problems can be mitigated by borrowing ideas from machine learning.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore; School of Biological Sciences, Nanyang Technological University, Singapore; Center for Biomedical Informatics, Nanyang Technological University, Singapore.
| | - Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore; School of Biological Sciences, Nanyang Technological University, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore; Department of Pathology, National University of Singapore, Singapore.
| |
Collapse
|
11
|
Hu Qian S, Shi MW, Wang DY, Fear JM, Chen L, Tu YX, Liu HS, Zhang Y, Zhang SJ, Yu SS, Oliver B, Chen ZX. Integrating massive RNA-seq data to elucidate transcriptome dynamics in Drosophila melanogaster. Brief Bioinform 2023; 24:bbad177. [PMID: 37232385 PMCID: PMC10505420 DOI: 10.1093/bib/bbad177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 04/19/2023] [Accepted: 04/20/2023] [Indexed: 05/27/2023] Open
Abstract
The volume of ribonucleic acid (RNA)-seq data has increased exponentially, providing numerous new insights into various biological processes. However, due to significant practical challenges, such as data heterogeneity, it is still difficult to ensure the quality of these data when integrated. Although some quality control methods have been developed, sample consistency is rarely considered and these methods are susceptible to artificial factors. Here, we developed MassiveQC, an unsupervised machine learning-based approach, to automatically download and filter large-scale high-throughput data. In addition to the read quality used in other tools, MassiveQC also uses the alignment and expression quality as model features. Meanwhile, it is user-friendly since the cutoff is generated from self-reporting and is applicable to multimodal data. To explore its value, we applied MassiveQC to Drosophila RNA-seq data and generated a comprehensive transcriptome atlas across 28 tissues from embryogenesis to adulthood. We systematically characterized fly gene expression dynamics and found that genes with high expression dynamics were likely to be evolutionarily young and expressed at late developmental stages, exhibiting high nonsynonymous substitution rates and low phenotypic severity, and they were involved in simple regulatory programs. We also discovered that human and Drosophila had strong positive correlations in gene expression in orthologous organs, revealing the great potential of the Drosophila system for studying human development and disease.
Collapse
Affiliation(s)
- Sheng Hu Qian
- Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China
| | - Meng-Wei Shi
- Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China
| | - Dan-Yang Wang
- Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China
| | - Justin M Fear
- Section of Developmental Genomics, National Institute of Diabetes and Kidney and Digestive Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Lu Chen
- Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China
| | - Yi-Xuan Tu
- Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China
| | - Hong-Shan Liu
- Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China
| | - Yuan Zhang
- Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China
| | - Shuai-Jie Zhang
- Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China
| | - Shan-Shan Yu
- Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China
| | - Brian Oliver
- Section of Developmental Genomics, National Institute of Diabetes and Kidney and Digestive Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Zhen-Xia Chen
- Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China
- Section of Developmental Genomics, National Institute of Diabetes and Kidney and Digestive Diseases, National Institutes of Health, Bethesda, MD 20892, USA
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University, Wuhan 430070, China
- Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, China
- Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shenzhen 518000, China
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518000, China
| |
Collapse
|
12
|
Zhao Y, Wang X, Sun T, Shan P, Zhan Z, Zhao Z, Jiang Y, Qu M, Lv Q, Wang Y, Liu P, Chen S. Artificial intelligence-driven electrochemical immunosensing biochips in multi-component detection. BIOMICROFLUIDICS 2023; 17:041301. [PMID: 37614678 PMCID: PMC10444200 DOI: 10.1063/5.0160808] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 08/01/2023] [Indexed: 08/25/2023]
Abstract
Electrochemical Immunosensing (EI) combines electrochemical analysis and immunology principles and is characterized by its simplicity, rapid detection, high sensitivity, and specificity. EI has become an important approach in various fields, such as clinical diagnosis, disease prevention and treatment, environmental monitoring, and food safety. However, EI multi-component detection still faces two major bottlenecks: first, the lack of cost-effective and portable detection platforms; second, the difficulty in eliminating batch differences and accurately decoupling signals from multiple analytes. With the gradual maturation of biochip technology, high-throughput analysis and portable detection utilizing the advantages of miniaturized chips, high sensitivity, and low cost have become possible. Meanwhile, Artificial Intelligence (AI) enables accurate decoupling of signals and enhances the sensitivity and specificity of multi-component detection. We believe that by evaluating and analyzing the characteristics, benefits, and linkages of EI, biochip, and AI technologies, we may considerably accelerate the development of EI multi-component detection. Therefore, we propose three specific prospects: first, AI can enhance and optimize the performance of the EI biochips, addressing the issue of multi-component detection for portable platforms. Second, the AI-enhanced EI biochips can be widely applied in home care, medical healthcare, and other areas. Third, the cross-fusion and innovation of EI, biochip, and AI technologies will effectively solve key bottlenecks in biochip detection, promoting interdisciplinary development. However, challenges may arise from AI algorithms that are difficult to explain and limited data access. Nevertheless, we believe that with technological advances and further research, there will be more methods and technologies to overcome these challenges.
Collapse
Affiliation(s)
- Yuliang Zhao
- School of Control Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066000, Hebei, China
| | - Xiaoai Wang
- School of Control Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066000, Hebei, China
| | - Tingting Sun
- School of Control Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066000, Hebei, China
| | - Peng Shan
- School of Control Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066000, Hebei, China
| | - Zhikun Zhan
- School of Control Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066000, Hebei, China
| | - Zhongpeng Zhao
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences (AMMS), Beijing 100071, China
| | - Yongqiang Jiang
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences (AMMS), Beijing 100071, China
| | - Mingyue Qu
- The PLA Rocket Force Characteristic Medical Center, Beijing 100088, China
| | - Qingyu Lv
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences (AMMS), Beijing 100071, China
| | - Ying Wang
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China
| | - Peng Liu
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences (AMMS), Beijing 100071, China
| | - Shaolong Chen
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences (AMMS), Beijing 100071, China
| |
Collapse
|
13
|
Kong W, Hui HWH, Peng H, Goh WWB. Dealing with missing values in proteomics data. Proteomics 2022; 22:e2200092. [PMID: 36349819 DOI: 10.1002/pmic.202200092] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 09/15/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022]
Abstract
Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapted for proteomics data. These MVI methods perform their tasks based on different prior assumptions (e.g., data is normally or independently distributed) and operating principles (e.g., the algorithm is built to address random missingness only), resulting in varying levels of performance even when dealing with the same dataset. Thus, to achieve a satisfactory outcome, a suitable MVI method must be selected. To guide decision making on suitable MVI method, we provide a decision chart which facilitates strategic considerations on datasets presenting different characteristics. We also bring attention to other issues that can impact proper MVI such as the presence of confounders (e.g., batch effects) which can influence MVI performance. Thus, these too, should be considered during or before MVI.
Collapse
Affiliation(s)
- Weijia Kong
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Hui Peng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore.,Centre for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
14
|
Magazzù G, Zampieri G, Angione C. Clinical stratification improves the diagnostic accuracy of small omics datasets within machine learning and genome-scale metabolic modelling methods. Comput Biol Med 2022; 151:106244. [PMID: 36343407 DOI: 10.1016/j.compbiomed.2022.106244] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 10/07/2022] [Accepted: 10/22/2022] [Indexed: 12/27/2022]
Abstract
BACKGROUND Recently, multi-omic machine learning architectures have been proposed for the early detection of cancer. However, for rare cancers and their associated small datasets, it is still unclear how to use the available multi-omics data to achieve a mechanistic prediction of cancer onset and progression, due to the limited data available. Hepatoblastoma is the most frequent liver cancer in infancy and childhood, and whose incidence has been lately increasing in several developed countries. Even though some studies have been conducted to understand the causes of its onset and discover potential biomarkers, the role of metabolic rewiring has not been investigated in depth so far. METHODS Here, we propose and implement an interpretable multi-omics pipeline that combines mechanistic knowledge from genome-scale metabolic models with machine learning algorithms, and we use it to characterise the underlying mechanisms controlling hepatoblastoma. RESULTS AND CONCLUSIONS While the obtained machine learning models generally present a high diagnostic classification accuracy, our results show that the type of omics combinations used as input to the machine learning models strongly affects the detection of important genes, reactions and metabolic pathways linked to hepatoblastoma. Our method also suggests that, in the context of computer-aided diagnosis of cancer, optimal diagnostic accuracy can be achieved by adopting a combination of omics that depends on the patient's clinical characteristics.
Collapse
Affiliation(s)
- Giuseppe Magazzù
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, England, United Kingdom
| | - Guido Zampieri
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, England, United Kingdom; Department of Biology, University of Padova, Padova, Italy
| | - Claudio Angione
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, England, United Kingdom; Centre for Digital Innovation, Teesside University, Middlesbrough, England, United Kingdom; National Horizons Centre, Teesside University, Darlington, England, United Kingdom.
| |
Collapse
|
15
|
Phua SX, Lim KP, Goh WWB. Perspectives for better batch effect correction in mass-spectrometry-based proteomics. Comput Struct Biotechnol J 2022; 20:4369-4375. [PMID: 36051874 PMCID: PMC9411064 DOI: 10.1016/j.csbj.2022.08.022] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 08/09/2022] [Accepted: 08/09/2022] [Indexed: 11/08/2022] Open
Abstract
Mass-spectrometry-based proteomics presents some unique challenges for batch effect correction. Batch effects are technical sources of variation, can confound analysis and usually non-biological in nature. As proteomic analysis involves several stages of data transformation from spectra to protein, the decision on when and what to apply batch correction on is often unclear. Here, we explore several relevant issues pertinent to batch effect correct considerations. The first involves applications of batch effect correction requiring prior knowledge on batch factors and exploring data to uncover new/unknown batch factors. The second considers recent literature that suggests there is no single best batch effect correction algorithm---i.e., instead of a best approach, one may instead ask, what is a suitable approach. The third section considers issues of batch effect detection. And finally, we look at potential developments for proteomic-specific batch effect correction methods and how to do better functional evaluations on batch corrected data.
Collapse
Affiliation(s)
- Ser-Xian Phua
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore
| | - Kai-Peng Lim
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore
| | - Wilson Wen-Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore
| |
Collapse
|