1
|
Han H. Challenges of reproducible AI in biomedical data science. BMC Med Genomics 2025; 18:8. [PMID: 39794788 PMCID: PMC11724458 DOI: 10.1186/s12920-024-02072-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2025] Open
Abstract
Artificial intelligence (AI) is revolutionizing biomedical data science at an unprecedented pace, transforming various aspects of the field with remarkable speed and depth. However, a critical issue remains unclear: how reproducible are the AI models and systems employed in biomedical data science? In this study, we examine the challenges of AI reproducibility by analyzing the factors influenced by data, model, and learning complexities, as well as through a game-theoretical perspective. While adherence to reproducibility standards is essential for the long-term advancement of AI, the conflict between following these standards and aligning with researchers' personal goals remains a significant hurdle in achieving AI reproducibility.
Collapse
Affiliation(s)
- Henry Han
- The Laboratory of Data Science and Artificial Intelligence Innovation, Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, 76798, USA.
| |
Collapse
|
2
|
Yang X, Tang X, Li C, Han H. Singular value thresholding two-stage matrix completion for drug sensitivity discovery. Comput Biol Chem 2024; 110:108071. [PMID: 38718497 DOI: 10.1016/j.compbiolchem.2024.108071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 04/06/2024] [Accepted: 04/11/2024] [Indexed: 05/27/2024]
Abstract
Incomplete data presents significant challenges in drug sensitivity analysis, especially in critical areas like oncology, where precision is paramount. Our study introduces an innovative imputation method designed specifically for low-rank matrices, addressing the crucial challenge of data completion in anticancer drug sensitivity testing. Our method unfolds in two main stages: Initially, the singular value thresholding algorithm is employed for preliminary matrix completion, establishing a solid foundation for subsequent steps. Then, the matrix rows are segmented into distinct blocks based on hierarchical clustering of correlation coefficients, applying singular value thresholding to the largest block, which has been proved to possess the largest entropy. This is followed by a refined data restoration process, where the reconstructed largest block is integrated into the initial matrix completion to achieve the final matrix completion. Compared to other methods, our approach not only improves the accuracy of data restoration but also ensures the integrity and reliability of the imputed values, establishing it as a robust tool for future drug sensitivity analysis.
Collapse
Affiliation(s)
- Xuemei Yang
- School of Mathematics and Statistics, Xianyang Normal University, Xianyang, 712000, China.
| | - Xiaoduan Tang
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China.
| | - Chun Li
- College of Elementary Education, Hainan Normal University, Haikou 571158, China; Key Laboratory of Data Science and Intelligence Education of Ministry of Education, Hainan Normal University, Haikou 571158, China.
| | - Henry Han
- The Laboratory of Data Science and Artificial Intelligence Innovation, Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX 76798 USA.
| |
Collapse
|
3
|
Men K, Li Y, Wang X, Zhang G, Hu J, Gao Y, Han A, Liu W, Han H. Estimate the incubation period of coronavirus 2019 (COVID-19). Comput Biol Med 2023; 158:106794. [PMID: 37044045 PMCID: PMC10062796 DOI: 10.1016/j.compbiomed.2023.106794] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Revised: 02/23/2023] [Accepted: 03/20/2023] [Indexed: 04/14/2023]
Abstract
COVID-19 is an infectious disease that presents unprecedented challenges to society. Accurately estimating the incubation period of the coronavirus is critical for effective prevention and control. However, the exact incubation period remains unclear, as COVID-19 symptoms can appear in as little as 2 days or as long as 14 days or more after exposure. Accurate estimation requires original chain-of-infection data, which may not be fully available from the original outbreak in Wuhan, China. In this study, we estimated the incubation period of COVID-19 by leveraging well-documented and epidemiologically informative chain-of-infection data collected from 10 regions outside the original Wuhan areas prior to February 10, 2020. We employed a proposed Monte Carlo simulation approach and nonparametric methods to estimate the incubation period of COVID-19. We also utilized manifold learning and related statistical analysis to uncover incubation relationships between different age and gender groups. Our findings revealed that the incubation period of COVID-19 did not follow general distributions such as lognormal, Weibull, or Gamma. Using proposed Monte Carlo simulations and nonparametric bootstrap methods, we estimated the mean and median incubation periods as 5.84 (95% CI, 5.42-6.25 days) and 5.01 days (95% CI 4.00-6.00 days), respectively. We also found that the incubation periods of groups with ages greater than or equal to 40 years and less than 40 years demonstrated a statistically significant difference. The former group had a longer incubation period and a larger variance than the latter, suggesting the need for different quarantine times or medical intervention strategies. Our machine-learning results further demonstrated that the two age groups were linearly separable, consistent with previous statistical analyses. Additionally, our results indicated that the incubation period difference between males and females was not statistically significant.
Collapse
Affiliation(s)
- Ke Men
- Institute for Research on Health Information and Technology, School of Public Health, Xi'an Medical University, Xi'an, Shaanxi, 710021, China
| | - Yihao Li
- The Gabelli School of Business, Fordham University, Lincoln Center, New York, NY, 10023, USA
| | - Xia Wang
- The Air Force Military Medical University, Xi'an, Shaanxi, 710032, China
| | - Guangwei Zhang
- Institute for Research on Health Information and Technology, School of Public Health, Xi'an Medical University, Xi'an, Shaanxi, 710021, China
| | - Jingjing Hu
- Institute for Research on Health Information and Technology, School of Public Health, Xi'an Medical University, Xi'an, Shaanxi, 710021, China
| | - Yanyan Gao
- Institute for Research on Health Information and Technology, School of Public Health, Xi'an Medical University, Xi'an, Shaanxi, 710021, China
| | - Ashley Han
- The Skyline High School, Ann Arbor, MI, 48103, USA
| | - Wenbin Liu
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| | - Henry Han
- The Laboratory of Data Science and Artificial Intelligence Innovation, Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, 76789, USA.
| |
Collapse
|
4
|
Hornung AL, Hornung CM, Mallow GM, Barajas JN, Espinoza Orías AA, Galbusera F, Wilke HJ, Colman M, Phillips FM, An HS, Samartzis D. Artificial intelligence and spine imaging: limitations, regulatory issues and future direction. EUROPEAN SPINE JOURNAL : OFFICIAL PUBLICATION OF THE EUROPEAN SPINE SOCIETY, THE EUROPEAN SPINAL DEFORMITY SOCIETY, AND THE EUROPEAN SECTION OF THE CERVICAL SPINE RESEARCH SOCIETY 2022; 31:2007-2021. [PMID: 35084588 DOI: 10.1007/s00586-021-07108-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 11/29/2021] [Accepted: 12/30/2021] [Indexed: 01/20/2023]
Abstract
BACKGROUND As big data and artificial intelligence (AI) in spine care, and medicine as a whole, continue to be at the forefront of research, careful consideration to the quality and techniques utilized is necessary. Predictive modeling, data science, and deep analytics have taken center stage. Within that space, AI and machine learning (ML) approaches toward the use of spine imaging have gathered considerable attention in the past decade. Although several benefits of such applications exist, limitations are also present and need to be considered. PURPOSE The following narrative review presents the current status of AI, in particular, ML, with special regard to imaging studies, in the field of spinal research. METHODS A multi-database assessment of the literature was conducted up to September 1, 2021, that addressed AI as it related to imaging of the spine. Articles written in English were selected and critically assessed. RESULTS Overall, the review discussed the limitations, data quality and applications of ML models in the context of spine imaging. In particular, we addressed the data quality and ML algorithms in spine imaging research by describing preliminary results from a widely accessible imaging algorithm that is currently available for spine specialists to reference for information on severity of spine disease and degeneration which ultimately may alter clinical decision-making. In addition, awareness of the current, under-recognized regulation surrounding the execution of ML for spine imaging was raised. CONCLUSIONS Recommendations were provided for conducting high-quality, standardized AI applications for spine imaging.
Collapse
Affiliation(s)
- Alexander L Hornung
- Department of Orthopaedic Surgery, Rush University Medical Center, Orthopaedic Building, Suite 204-G, 1611 W. Harrison Street, Chicago, IL, 60612, USA
| | | | - G Michael Mallow
- Department of Orthopaedic Surgery, Rush University Medical Center, Orthopaedic Building, Suite 204-G, 1611 W. Harrison Street, Chicago, IL, 60612, USA
| | - J Nicolas Barajas
- Department of Orthopaedic Surgery, Rush University Medical Center, Orthopaedic Building, Suite 204-G, 1611 W. Harrison Street, Chicago, IL, 60612, USA
| | - Alejandro A Espinoza Orías
- Department of Orthopaedic Surgery, Rush University Medical Center, Orthopaedic Building, Suite 204-G, 1611 W. Harrison Street, Chicago, IL, 60612, USA
| | | | - Hans-Joachim Wilke
- Institute of Orthopaedic Research and Biomechanics, Trauma Research Center Ulm, Ulm University, Ulm, Germany
| | - Matthew Colman
- Department of Orthopaedic Surgery, Rush University Medical Center, Orthopaedic Building, Suite 204-G, 1611 W. Harrison Street, Chicago, IL, 60612, USA
| | - Frank M Phillips
- Department of Orthopaedic Surgery, Rush University Medical Center, Orthopaedic Building, Suite 204-G, 1611 W. Harrison Street, Chicago, IL, 60612, USA
| | - Howard S An
- Department of Orthopaedic Surgery, Rush University Medical Center, Orthopaedic Building, Suite 204-G, 1611 W. Harrison Street, Chicago, IL, 60612, USA
| | - Dino Samartzis
- Department of Orthopaedic Surgery, Rush University Medical Center, Orthopaedic Building, Suite 204-G, 1611 W. Harrison Street, Chicago, IL, 60612, USA.
| |
Collapse
|
5
|
|
6
|
Poirion OB, Jing Z, Chaudhary K, Huang S, Garmire LX. DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome Med 2021; 13:112. [PMID: 34261540 PMCID: PMC8281595 DOI: 10.1186/s13073-021-00930-x] [Citation(s) in RCA: 120] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 06/25/2021] [Indexed: 12/17/2022] Open
Abstract
Multi-omics data are good resources for prognosis and survival prediction; however, these are difficult to integrate computationally. We introduce DeepProg, a novel ensemble framework of deep-learning and machine-learning approaches that robustly predicts patient survival subtypes using multi-omics data. It identifies two optimal survival subtypes in most cancers and yields significantly better risk-stratification than other multi-omics integration methods. DeepProg is highly predictive, exemplified by two liver cancer (C-index 0.73-0.80) and five breast cancer datasets (C-index 0.68-0.73). Pan-cancer analysis associates common genomic signatures in poor survival subtypes with extracellular matrix modeling, immune deregulation, and mitosis processes. DeepProg is freely available at https://github.com/lanagarmire/DeepProg.
Collapse
Affiliation(s)
- Olivier B Poirion
- Current address: Computational Sciences, The Jackson Laboratory, 10 Discovery Drive Farmington, Farmington, Connecticut, 06032, USA
- University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| | - Zheng Jing
- Current address: Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48105, USA
| | - Kumardeep Chaudhary
- University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
- Current address: Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY, 10029, USA
| | - Sijia Huang
- University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
- Current address: Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Lana X Garmire
- University of Hawaii Cancer Center, Honolulu, HI, 96813, USA.
- Current address: Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48105, USA.
| |
Collapse
|
7
|
Lin X, Li C, Ren W, Luo X, Qi Y. A new feature selection method based on symmetrical uncertainty and interaction gain. Comput Biol Chem 2019; 83:107149. [DOI: 10.1016/j.compbiolchem.2019.107149] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2018] [Revised: 06/06/2019] [Accepted: 10/15/2019] [Indexed: 12/15/2022]
|
8
|
Bravo-Merodio L, Williams JA, Gkoutos GV, Acharjee A. -Omics biomarker identification pipeline for translational medicine. J Transl Med 2019; 17:155. [PMID: 31088492 PMCID: PMC6518609 DOI: 10.1186/s12967-019-1912-5] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2018] [Accepted: 05/08/2019] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND Translational medicine (TM) is an emerging domain that aims to facilitate medical or biological advances efficiently from the scientist to the clinician. Central to the TM vision is to narrow the gap between basic science and applied science in terms of time, cost and early diagnosis of the disease state. Biomarker identification is one of the main challenges within TM. The identification of disease biomarkers from -omics data will not only help the stratification of diverse patient cohorts but will also provide early diagnostic information which could improve patient management and potentially prevent adverse outcomes. However, biomarker identification needs to be robust and reproducible. Hence a robust unbiased computational framework that can help clinicians identify those biomarkers is necessary. METHODS We developed a pipeline (workflow) that includes two different supervised classification techniques based on regularization methods to identify biomarkers from -omics or other high dimension clinical datasets. The pipeline includes several important steps such as quality control and stability of selected biomarkers. The process takes input files (outcome and independent variables or -omics data) and pre-processes (normalization, missing values) them. After a random division of samples into training and test sets, Least Absolute Shrinkage and Selection Operator and Elastic Net feature selection methods are applied to identify the most important features representing potential biomarker candidates. The penalization parameters are optimised using 10-fold cross validation and the process undergoes 100 iterations and a combinatorial analysis to select the best performing multivariate model. An empirical unbiased assessment of their quality as biomarkers for clinical use is performed through a Receiver Operating Characteristic curve and its Area Under the Curve analysis on both permuted and real data for 1000 different randomized training and test sets. We validated this pipeline against previously published biomarkers. RESULTS We applied this pipeline to three different datasets with previously published biomarkers: lipidomics data by Acharjee et al. (Metabolomics 13:25, 2017) and transcriptomics data by Rajamani and Bhasin (Genome Med 8:38, 2016) and Mills et al. (Blood 114:1063-1072, 2009). Our results demonstrate that our method was able to identify both previously published biomarkers as well as new variables that add value to the published results. CONCLUSIONS We developed a robust pipeline to identify clinically relevant biomarkers that can be applied to different -omics datasets. Such identification reveals potentially novel drug targets and can be used as a part of a machine-learning based patient stratification framework in the translational medicine settings.
Collapse
Affiliation(s)
- Laura Bravo-Merodio
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, B15 2TT UK
| | - John A. Williams
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, B15 2TT UK
- Mammalian Genetics Unit, Medical Research Council Harwell Institute, Harwell Campus, Didcot, OX11 0RD UK
| | - Georgios V. Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, B15 2TT UK
- MRC Health Data Research UK (HDR UK), London, UK
- NIHR Experimental Cancer Medicine Centre, Birmingham, B15 2TT UK
- NIHR Surgical Reconstruction and Microbiology Research Centre, Birmingham, B15 2TT UK
- NIHR Biomedical Research Centre, Birmingham, B15 2TT UK
| | - Animesh Acharjee
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, B15 2TT UK
- NIHR Surgical Reconstruction and Microbiology Research Centre, Birmingham, B15 2TT UK
| |
Collapse
|
9
|
How does normalization impact RNA-seq disease diagnosis? J Biomed Inform 2018; 85:80-92. [PMID: 30041017 DOI: 10.1016/j.jbi.2018.07.016] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Revised: 07/07/2018] [Accepted: 07/14/2018] [Indexed: 12/18/2022]
Abstract
With the surge of next generation high-throughput technologies, RNA-seq data is playing an increasingly important role in disease diagnosis, in which normalization is assumed as an essential procedure to produce comparable samples. Recent studies have seen different normalization methods proposed to remove various technical biases in RNA sequencing. However, there are no previous studies evaluating the impacts of normalization on RNA-seq disease diagnosis. In this study, we investigate this problem by analyzing structured big data: RNA-seq data acquired from the TCGA portal for its popularity in RNA-seq disease diagnosis. We propose a novel normalization effect test algorithm, diagnostic index (d-index), and data entropy to analyze and evaluate the impacts of normalization on RNA-seq disease diagnosis by using state-of-the-art machine learning models. Furthermore, we present an original visualization analysis to compare the performance of normalized data versus raw data. We have found that normalized data yields generally an equivalent or even lower level diagnosis than its raw data. Moreover, some normalization approaches (e.g. RPKM) even bring negative effects in disease diagnosis. On the other hand, raw data seems to have the potential to decipher pathological status better or at least comparable than when the data is normalized. Our visualization analysis also shows that some normalization methods even bring 'outliers', which unavoidably decreases sample detectability in diagnosis. More importantly, our data entropy analysis shows that normalized data usually demonstrates equivalent or lower entropy values than raw data. Those data with high entropy values tend to achieve better diagnosis than those with low entropy values. In addition, we found that high-dimensional imbalance (HDI) data is unaffected by any normalization procedures in diagnosis, and fails almost all machine learning models by only recognizing majority types in spite of raw or normalized data. Our results suggest that normalized data may not demonstrate statistically significant advantages in disease diagnosis than its raw form. It further implies that normalization may not be an indispensable procedure in RNA-seq disease diagnosis or at least some normalization processes may not be. Instead, raw data may perform better for capturing more original transcriptome patterns in different pathological conditions.
Collapse
|
10
|
Han H. A novel feature selection for RNA-seq analysis. Comput Biol Chem 2017; 71:245-257. [DOI: 10.1016/j.compbiolchem.2017.10.010] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2017] [Accepted: 10/27/2017] [Indexed: 12/17/2022]
|