1
|
Su G, Wang H, Zhang Y, Wilkins MR, Canete PF, Yu D, Yang Y, Zhang W. Inferring gene regulatory networks by hypergraph generative model. CELL REPORTS METHODS 2025; 5:101026. [PMID: 40220759 DOI: 10.1016/j.crmeth.2025.101026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 01/16/2025] [Accepted: 03/20/2025] [Indexed: 04/14/2025]
Abstract
We present hypergraph variational autoencoder (HyperG-VAE), a Bayesian deep generative model that leverages hypergraph representation to model single-cell RNA sequencing (scRNA-seq) data. The model features a cell encoder with a structural equation model to account for cellular heterogeneity and construct gene regulatory networks (GRNs) alongside a gene encoder using hypergraph self-attention to identify gene modules. The synergistic optimization of encoders via a decoder improves GRN inference, single-cell clustering, and data visualization, as validated by benchmarks. HyperG-VAE effectively uncovers gene regulation patterns and demonstrates robustness in downstream analyses, as shown in B cell development data from bone marrow. Gene set enrichment analysis of overlapping genes in predicted GRNs confirms the gene encoder's role in refining GRN inference. Offering an efficient solution for scRNA-seq analysis and GRN construction, HyperG-VAE also holds the potential for extending GRN modeling to temporal and multimodal single-cell omics.
Collapse
Affiliation(s)
- Guangxin Su
- School of Computer Science and Engineering, The University of New South Wales, Sydney, NSW, Australia; ARC Centre of Excellence for the Mathematical Analysis of Cellular Systems (MACSYS), Melbourne, VIC, Australia
| | - Hanchen Wang
- ARC Centre of Excellence for the Mathematical Analysis of Cellular Systems (MACSYS), Melbourne, VIC, Australia; Australian Artificial Intelligence Institute, The University of Technology Sydney, Sydney, NSW, Australia
| | - Ying Zhang
- School of Computer Science and Technology, Zhejiang Gongshang University, Zhejiang, China
| | - Marc R Wilkins
- ARC Centre of Excellence for the Mathematical Analysis of Cellular Systems (MACSYS), Melbourne, VIC, Australia; Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, The University of New South Wales, Sydney, NSW, Australia
| | - Pablo F Canete
- Frazer Institute, Faculty of Health, Medicine and Behaviour Sciences, The University of Queensland, Brisbane, QLD, Australia
| | - Di Yu
- Frazer Institute, Faculty of Health, Medicine and Behaviour Sciences, The University of Queensland, Brisbane, QLD, Australia; Ian Frazer Centre for Children's Immunotherapy Research, Child Health Research Centre, Faculty of Health, Medicine and Behaviour Sciences, The University of Queensland, Brisbane, QLD, Australia
| | - Yang Yang
- Frazer Institute, Faculty of Health, Medicine and Behaviour Sciences, The University of Queensland, Brisbane, QLD, Australia.
| | - Wenjie Zhang
- School of Computer Science and Engineering, The University of New South Wales, Sydney, NSW, Australia; ARC Centre of Excellence for the Mathematical Analysis of Cellular Systems (MACSYS), Melbourne, VIC, Australia.
| |
Collapse
|
2
|
Asim MN, Ibrahim MA, Zaib A, Dengel A. DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models. Front Med (Lausanne) 2025; 12:1503229. [PMID: 40265190 PMCID: PMC12011883 DOI: 10.3389/fmed.2025.1503229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 03/10/2025] [Indexed: 04/24/2025] Open
Abstract
Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline and somatic mutations. Germline mutations underlie hereditary conditions, while somatic mutations can be induced by various factors including environmental influences, chemicals, lifestyle choices, and errors in DNA replication and repair mechanisms which can lead to cancer. DNA sequence analysis plays a pivotal role in uncovering the intricate information embedded within an organism's genetic blueprint and understanding the factors that can modify it. This analysis helps in early detection of genetic diseases and the design of targeted therapies. Traditional wet-lab experimental DNA sequence analysis through traditional wet-lab experimental methods is costly, time-consuming, and prone to errors. To accelerate large-scale DNA sequence analysis, researchers are developing AI applications that complement wet-lab experimental methods. These AI approaches can help generate hypotheses, prioritize experiments, and interpret results by identifying patterns in large genomic datasets. Effective integration of AI methods with experimental validation requires scientists to understand both fields. Considering the need of a comprehensive literature that bridges the gap between both fields, contributions of this paper are manifold: It presents diverse range of DNA sequence analysis tasks and AI methodologies. It equips AI researchers with essential biological knowledge of 44 distinct DNA sequence analysis tasks and aligns these tasks with 3 distinct AI-paradigms, namely, classification, regression, and clustering. It streamlines the integration of AI into DNA sequence analysis tasks by consolidating information of 36 diverse biological databases that can be used to develop benchmark datasets for 44 different DNA sequence analysis tasks. To ensure performance comparisons between new and existing AI predictors, it provides insights into 140 benchmark datasets related to 44 distinct DNA sequence analysis tasks. It presents word embeddings and language models applications across 44 distinct DNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 39 word embeddings and 67 language models based predictive pipeline performance values as well as top performing traditional sequence encoding-based predictors and their performances across 44 DNA sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Arooj Zaib
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| |
Collapse
|
3
|
Li R, Xu S, Li Y, Tang Z, Feng D, Cai J, Ma S. Incorporating prior information in gene expression network-based cancer heterogeneity analysis. Biostatistics 2024; 26:kxae028. [PMID: 39074174 DOI: 10.1093/biostatistics/kxae028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 07/04/2024] [Accepted: 07/08/2024] [Indexed: 07/31/2024] Open
Abstract
Cancer is molecularly heterogeneous, with seemingly similar patients having different molecular landscapes and accordingly different clinical behaviors. In recent studies, gene expression networks have been shown as more effective/informative for cancer heterogeneity analysis than some simpler measures. Gene interconnections can be classified as "direct" and "indirect," where the latter can be caused by shared genomic regulators (such as transcription factors, microRNAs, and other regulatory molecules) and other mechanisms. It has been suggested that incorporating the regulators of gene expressions in network analysis and focusing on the direct interconnections can lead to a deeper understanding of the more essential gene interconnections. Such analysis can be seriously challenged by the large number of parameters (jointly caused by network analysis, incorporation of regulators, and heterogeneity) and often weak signals. To effectively tackle this problem, we propose incorporating prior information contained in the published literature. A key challenge is that such prior information can be partial or even wrong. We develop a two-step procedure that can flexibly accommodate different levels of prior information quality. Simulation demonstrates the effectiveness of the proposed approach and its superiority over relevant competitors. In the analysis of a breast cancer dataset, findings different from the alternatives are made, and the identified sample subgroups have important clinical differences.
Collapse
Affiliation(s)
- Rong Li
- Department of Biostatistics, Yale School of Public Health, 60 College Street, New Haven, 06511, CT, United States
| | - Shaodong Xu
- Center for Applied Statistics and School of Statistics, Renmin University of China, 59 Zhongguancun Street, 100872, Beijing, China
| | - Yang Li
- Center for Applied Statistics and School of Statistics, Renmin University of China, 59 Zhongguancun Street, 100872, Beijing, China
| | - Zuojian Tang
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharmaceuticals Inc., 900 Ridgebury Road, Ridgefield, 06877, CT, United States
| | - Di Feng
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharmaceuticals Inc., 900 Ridgebury Road, Ridgefield, 06877, CT, United States
| | - James Cai
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharmaceuticals Inc., 900 Ridgebury Road, Ridgefield, 06877, CT, United States
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, 60 College Street, New Haven, 06511, CT, United States
| |
Collapse
|
4
|
Tarzi C, Zampieri G, Sullivan N, Angione C. Emerging methods for genome-scale metabolic modeling of microbial communities. Trends Endocrinol Metab 2024; 35:533-548. [PMID: 38575441 DOI: 10.1016/j.tem.2024.02.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 02/28/2024] [Accepted: 02/29/2024] [Indexed: 04/06/2024]
Abstract
Genome-scale metabolic models (GEMs) are consolidating as platforms for studying mixed microbial populations, by combining biological data and knowledge with mathematical rigor. However, deploying these models to answer research questions can be challenging due to the increasing number of available computational tools, the lack of universal standards, and their inherent limitations. Here, we present a comprehensive overview of foundational concepts for building and evaluating genome-scale models of microbial communities. We then compare tools in terms of requirements, capabilities, and applications. Next, we highlight the current pitfalls and open challenges to consider when adopting existing tools and developing new ones. Our compendium can be relevant for the expanding community of modelers, both at the entry and experienced levels.
Collapse
Affiliation(s)
- Chaimaa Tarzi
- School of Computing, Engineering and Digital Technologies, Teesside University, Southfield Rd, Middlesbrough, TS1 3BX, North Yorkshire, UK
| | - Guido Zampieri
- Department of Biology, University of Padova, Padova, 35122, Veneto, Italy
| | - Neil Sullivan
- Complement Genomics Ltd, Station Rd, Lanchester, Durham, DH7 0EX, County Durham, UK
| | - Claudio Angione
- School of Computing, Engineering and Digital Technologies, Teesside University, Southfield Rd, Middlesbrough, TS1 3BX, North Yorkshire, UK; Centre for Digital Innovation, Teesside University, Southfield Rd, Middlesbrough, TS1 3BX, North Yorkshire, UK; National Horizons Centre, Teesside University, 38 John Dixon Ln, Darlington, DL1 1HG, North Yorkshire, UK.
| |
Collapse
|
5
|
Hasibi R, Michoel T, Oyarzún DA. Integration of graph neural networks and genome-scale metabolic models for predicting gene essentiality. NPJ Syst Biol Appl 2024; 10:24. [PMID: 38448436 PMCID: PMC10917767 DOI: 10.1038/s41540-024-00348-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 02/08/2024] [Indexed: 03/08/2024] Open
Abstract
Genome-scale metabolic models are powerful tools for understanding cellular physiology. Flux balance analysis (FBA), in particular, is an optimization-based approach widely employed for predicting metabolic phenotypes. In model microbes such as Escherichia coli, FBA has been successful at predicting essential genes, i.e. those genes that impair survival when deleted. A central assumption in this approach is that both wild type and deletion strains optimize the same fitness objective. Although the optimality assumption may hold for the wild type metabolic network, deletion strains are not subject to the same evolutionary pressures and knock-out mutants may steer their metabolism to meet other objectives for survival. Here, we present FlowGAT, a hybrid FBA-machine learning strategy for predicting essentiality directly from wild type metabolic phenotypes. The approach is based on graph-structured representation of metabolic fluxes predicted by FBA, where nodes correspond to enzymatic reactions and edges quantify the propagation of metabolite mass flow between a reaction and its neighbours. We integrate this information into a graph neural network that can be trained on knock-out fitness assay data. Comparisons across different model architectures reveal that FlowGAT predictions for E. coli are close to those of FBA for several growth conditions. This suggests that essentiality of enzymatic genes can be predicted by exploiting the inherent network structure of metabolism. Our approach demonstrates the benefits of combining the mechanistic insights afforded by genome-scale models with the ability of deep learning to infer patterns from complex datasets.
Collapse
Affiliation(s)
- Ramin Hasibi
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
| | - Tom Michoel
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
| | - Diego A Oyarzún
- School of Biological Sciences, University of Edinburgh, Edinburgh, UK.
- School of Informatics, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
6
|
Gopalakrishnan S, Johnson W, Valderrama-Gomez MA, Icten E, Tat J, Ingram M, Fung Shek C, Chan PK, Schlegel F, Rolandi P, Kontoravdi C, Lewis NE. COSMIC-dFBA: A novel multi-scale hybrid framework for bioprocess modeling. Metab Eng 2024; 82:183-192. [PMID: 38387677 DOI: 10.1016/j.ymben.2024.02.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 02/01/2024] [Accepted: 02/19/2024] [Indexed: 02/24/2024]
Abstract
Metabolism governs cell performance in biomanufacturing, as it fuels growth and productivity. However, even in well-controlled culture systems, metabolism is dynamic, with shifting objectives and resources, thus limiting the predictive capability of mechanistic models for process design and optimization. Here, we present Cellular Objectives and State Modulation In bioreaCtors (COSMIC)-dFBA, a hybrid multi-scale modeling paradigm that accurately predicts cell density, antibody titer, and bioreactor metabolite concentration profiles. Using machine-learning, COSMIC-dFBA decomposes the instantaneous metabolite uptake and secretion rates in a bioreactor into weighted contributions from each cell state (growth or antibody-producing state) and integrates these with a genome-scale metabolic model. A major strength of COSMIC-dFBA is that it can be parameterized with only metabolite concentrations from spent media, although constraining the metabolic model with other omics data can further improve its capabilities. Using COSMIC-dFBA, we can predict the final cell density and antibody titer to within 10% of the measured data, and compared to a standard dFBA model, we found the framework showed a 90% and 72% improvement in cell density and antibody titer prediction, respectively. Thus, we demonstrate our hybrid modeling framework effectively captures cellular metabolism and expands the applicability of dFBA to model the dynamic conditions in a bioreactor.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Cleo Kontoravdi
- Department of Chemical Engineering, Imperial College London, UK
| | - Nathan E Lewis
- Department of Pediatrics, University of California San Diego, USA; Department of Bioengineering, University of California San Diego, USA.
| |
Collapse
|
7
|
Marcos-Zambrano LJ, López-Molina VM, Bakir-Gungor B, Frohme M, Karaduzovic-Hadziabdic K, Klammsteiner T, Ibrahimi E, Lahti L, Loncar-Turukalo T, Dhamo X, Simeon A, Nechyporenko A, Pio G, Przymus P, Sampri A, Trajkovik V, Lacruz-Pleguezuelos B, Aasmets O, Araujo R, Anagnostopoulos I, Aydemir Ö, Berland M, Calle ML, Ceci M, Duman H, Gündoğdu A, Havulinna AS, Kaka Bra KHN, Kalluci E, Karav S, Lode D, Lopes MB, May P, Nap B, Nedyalkova M, Paciência I, Pasic L, Pujolassos M, Shigdel R, Susín A, Thiele I, Truică CO, Wilmes P, Yilmaz E, Yousef M, Claesson MJ, Truu J, Carrillo de Santa Pau E. A toolbox of machine learning software to support microbiome analysis. Front Microbiol 2023; 14:1250806. [PMID: 38075858 PMCID: PMC10704913 DOI: 10.3389/fmicb.2023.1250806] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/11/2023] [Indexed: 05/14/2025] Open
Abstract
The human microbiome has become an area of intense research due to its potential impact on human health. However, the analysis and interpretation of this data have proven to be challenging due to its complexity and high dimensionality. Machine learning (ML) algorithms can process vast amounts of data to uncover informative patterns and relationships within the data, even with limited prior knowledge. Therefore, there has been a rapid growth in the development of software specifically designed for the analysis and interpretation of microbiome data using ML techniques. These software incorporate a wide range of ML algorithms for clustering, classification, regression, or feature selection, to identify microbial patterns and relationships within the data and generate predictive models. This rapid development with a constant need for new developments and integration of new features require efforts into compile, catalog and classify these tools to create infrastructures and services with easy, transparent, and trustable standards. Here we review the state-of-the-art for ML tools applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on ML based software and framework resources currently available for the analysis of microbiome data in humans. The aim is to support microbiologists and biomedical scientists to go deeper into specialized resources that integrate ML techniques and facilitate future benchmarking to create standards for the analysis of microbiome data. The software resources are organized based on the type of analysis they were developed for and the ML techniques they implement. A description of each software with examples of usage is provided including comments about pitfalls and lacks in the usage of software based on ML methods in relation to microbiome data that need to be considered by developers and users. This review represents an extensive compilation to date, offering valuable insights and guidance for researchers interested in leveraging ML approaches for microbiome analysis.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Víctor Manuel López-Molina
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gül University, Kayseri, Türkiye
| | - Marcus Frohme
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | | | - Thomas Klammsteiner
- Department of Microbiology and Department of Ecology, University of Innsbruck, Innsbruck, Austria
| | - Eliana Ibrahimi
- Department of Biology, University of Tirana, Tirana, Albania
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | | | - Xhilda Dhamo
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Alina Nechyporenko
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
- Department of Systems Engineering, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine
| | - Gianvito Pio
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
- Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
| | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Alexia Sampri
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, United Kingdom
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Blanca Lacruz-Pleguezuelos
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Ricardo Araujo
- Nephrology and Infectious Diseases R & D Group, i3S—Instituto de Investigação e Inovação em Saúde; INEB—Instituto de Engenharia Biomédica, Universidade do Porto, Porto, Portugal
| | - Ioannis Anagnostopoulos
- Department of Informatics, University of Piraeus, Piraeus, Greece
- Computer Science and Biomedical Informatics Department, University of Thessaly, Lamia, Greece
| | - Önder Aydemir
- Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Türkiye
| | - Magali Berland
- INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
| | - M. Luz Calle
- Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
- IRIS-CC, Fundació Institut de Recerca i Innovació en Ciències de la Vida i la Salut a la Catalunya Central, Vic, Barcelona, Spain
| | - Michelangelo Ceci
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
- Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
| | - Hatice Duman
- Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
| | - Aycan Gündoğdu
- Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Türkiye
- Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Türkiye
| | - Aki S. Havulinna
- Finnish Institute for Health and Welfare - THL, Helsinki, Finland
- Institute for Molecular Medicine Finland, FIMM-HiLIFE, Helsinki, Finland
| | | | - Eglantina Kalluci
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Sercan Karav
- Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
| | - Daniel Lode
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | - Marta B. Lopes
- Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Bram Nap
- School of Medicine, University of Galway, Galway, Ireland
| | - Miroslava Nedyalkova
- Department of Inorganic Chemistry, Faculty of Chemistry and Pharmacy, University of Sofia, Sofia, Bulgaria
| | - Inês Paciência
- Center for Environmental and Respiratory Health Research (CERH), Research Unit of Population Health, University of Oulu, Oulu, Finland
- Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Lejla Pasic
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Meritxell Pujolassos
- Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Antonio Susín
- Mathematical Department, UPC-Barcelona Tech, Barcelona, Spain
| | - Ines Thiele
- School of Medicine, University of Galway, Galway, Ireland
- APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Ciprian-Octavian Truică
- Computer Science and Engineering Department, Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica, Bucharest, Romania
| | - Paul Wilmes
- Systems Ecology Group, Luxembourg Centre for Systems Biomedicine, Esch-sur-Alzette, Luxembourg
- Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, Belvaux, Luxembourg
| | - Ercument Yilmaz
- Department of Computer Technologies, Karadeniz Technical University, Trabzon, Türkiye
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Marcus Joakim Claesson
- APC Microbiome Ireland, University College Cork, Cork, Ireland
- School of Microbiology, University College Cork, Cork, Ireland
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | | |
Collapse
|
8
|
Valizadeh G, Babapour Mofrad F. Parametrized pre-trained network (PPNet): A novel shape classification method using SPHARMs for MI detection. EXPERT SYSTEMS WITH APPLICATIONS 2023; 228:120368. [DOI: 10.1016/j.eswa.2023.120368] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/11/2024]
|
9
|
Du P, Niu X, Li X, Ying C, Zhou Y, He C, Lv S, Liu X, Du W, Wu W. Automatically transferring supervised targets method for segmenting lung lesion regions with CT imaging. BMC Bioinformatics 2023; 24:332. [PMID: 37667214 PMCID: PMC10478337 DOI: 10.1186/s12859-023-05435-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 08/02/2023] [Indexed: 09/06/2023] Open
Abstract
BACKGROUND To present an approach that autonomously identifies and selects a self-selective optimal target for the purpose of enhancing learning efficiency to segment infected regions of the lung from chest computed tomography images. We designed a semi-supervised dual-branch framework for training, where the training set consisted of limited expert-annotated data and a large amount of coarsely annotated data that was automatically segmented based on Hu values, which were used to train both strong and weak branches. In addition, we employed the Lovasz scoring method to automatically switch the supervision target in the weak branch and select the optimal target as the supervision object for training. This method can use noisy labels for rapid localization during the early stages of training, and gradually use more accurate targets for supervised training as the training progresses. This approach can utilize a large number of samples that do not require manual annotation, and with the iterations of training, the supervised targets containing noise become closer and closer to the fine-annotated data, which significantly improves the accuracy of the final model. RESULTS The proposed dual-branch deep learning network based on semi-supervision together with cost-effective samples achieved 83.56 ± 12.10 and 82.67 ± 8.04 on our internal and external test benchmarks measured by the mean Dice similarity coefficient (DSC). Through experimental comparison, the DSC value of the proposed algorithm was improved by 13.54% and 2.02% on the internal benchmark and 13.37% and 2.13% on the external benchmark compared with U-Net without extra sample assistance and the mean-teacher frontier algorithm, respectively. CONCLUSION The cost-effective pseudolabeled samples assisted the training of DL models and achieved much better results compared with traditional DL models with manually labeled samples only. Furthermore, our method also achieved the best performance compared with other up-to-date dual branch structures.
Collapse
Affiliation(s)
- Peng Du
- Hangzhou AiSmartIoT Co., Ltd., Hangzhou, Zhejiang, China
| | - Xiaofeng Niu
- Artificial Intelligence Lab, Hangzhou AiSmartVision Co., Ltd., Hangzhou, Zhejiang, China
| | - Xukun Li
- Artificial Intelligence Lab, Hangzhou AiSmartVision Co., Ltd., Hangzhou, Zhejiang, China
| | - Chiqing Ying
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, School of Medicine, Zhejiang University, 79 QingChun Road, Hangzhou, 310003, Zhejiang, China
| | - Yukun Zhou
- Artificial Intelligence Lab, Hangzhou AiSmartVision Co., Ltd., Hangzhou, Zhejiang, China
| | - Chang He
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, School of Medicine, Zhejiang University, 79 QingChun Road, Hangzhou, 310003, Zhejiang, China
| | - Shuangzhi Lv
- Department of Radiology The First Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
| | - Xiaoli Liu
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, School of Medicine, Zhejiang University, 79 QingChun Road, Hangzhou, 310003, Zhejiang, China
| | - Weibo Du
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, School of Medicine, Zhejiang University, 79 QingChun Road, Hangzhou, 310003, Zhejiang, China.
| | - Wei Wu
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, School of Medicine, Zhejiang University, 79 QingChun Road, Hangzhou, 310003, Zhejiang, China.
| |
Collapse
|
10
|
Dou B, Zhu Z, Merkurjev E, Ke L, Chen L, Jiang J, Zhu Y, Liu J, Zhang B, Wei GW. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem Rev 2023; 123:8736-8780. [PMID: 37384816 PMCID: PMC10999174 DOI: 10.1021/acs.chemrev.3c00189] [Citation(s) in RCA: 79] [Impact Index Per Article: 39.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023]
Abstract
Small data are often used in scientific and engineering research due to the presence of various constraints, such as time, cost, ethics, privacy, security, and technical limitations in data acquisition. However, big data have been the focus for the past decade, small data and their challenges have received little attention, even though they are technically more severe in machine learning (ML) and deep learning (DL) studies. Overall, the small data challenge is often compounded by issues, such as data diversity, imputation, noise, imbalance, and high-dimensionality. Fortunately, the current big data era is characterized by technological breakthroughs in ML, DL, and artificial intelligence (AI), which enable data-driven scientific discovery, and many advanced ML and DL technologies developed for big data have inadvertently provided solutions for small data problems. As a result, significant progress has been made in ML and DL for small data challenges in the past decade. In this review, we summarize and analyze several emerging potential solutions to small data challenges in molecular science, including chemical and biological sciences. We review both basic machine learning algorithms, such as linear regression, logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), kernel learning (KL), random forest (RF), and gradient boosting trees (GBT), and more advanced techniques, including artificial neural network (ANN), convolutional neural network (CNN), U-Net, graph neural network (GNN), Generative Adversarial Network (GAN), long short-term memory (LSTM), autoencoder, transformer, transfer learning, active learning, graph-based semi-supervised learning, combining deep learning with traditional machine learning, and physical model-based data augmentation. We also briefly discuss the latest advances in these methods. Finally, we conclude the survey with a discussion of promising trends in small data challenges in molecular science.
Collapse
Affiliation(s)
- Bozheng Dou
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Zailiang Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Ekaterina Merkurjev
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Lu Ke
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Long Chen
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jie Liu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
11
|
De Ruvo S, Pio G, Vessio G, Volpe V. Forecasting and what-if analysis of new positive COVID-19 cases during the first three waves in Italy. Med Biol Eng Comput 2023:10.1007/s11517-023-02831-0. [PMID: 37316767 DOI: 10.1007/s11517-023-02831-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 03/29/2023] [Indexed: 06/16/2023]
Abstract
The joint exploitation of data related to epidemiological, mobility, and restriction aspects of COVID-19 with machine learning algorithms can support the development of predictive models that can be used to forecast new positive cases and study the impact of more or less severe restrictions. In this work, we integrate heterogeneous data from several sources and solve a multivariate time series forecasting task, specifically targeting the Italian case at both national and regional levels, during the first three waves of the pandemic. The goal is to build a robust predictive model to predict the number of new cases over a given time horizon so that any restrictive actions can be better planned. In addition, we perform a what-if analysis based on the best-identified predictive models to evaluate the impact of specific restrictions on the trend of positive cases. Our focus on the first three waves is motivated by the fact that it represents a typical emergency scenario (when no stable cure or vaccine is available) that may occur when a new pandemic spreads. Our experimental results prove that exploiting the considered heterogeneous data leads to accurate predictive models, reaching a WAPE of 5.75% at the national level. Furthermore, in the subsequent what-if analysis, we observed that strong all-in-one initiatives, such as total lockdowns, may not be adequate, while more specific and targeted solutions should be adopted. The developed models can help policy and decision-makers better plan intervention strategies and retrospectively analyze the effects of the decisions made at different scales. Joint exploitation of data on epidemiological, mobility, and restriction aspects of COVID-19 with machine learning algorithms to learn predictive models to forecast new positive cases.
Collapse
Affiliation(s)
- Serena De Ruvo
- Dept. of Computer Science, University of Bari Aldo Moro, Bari, Italy
| | - Gianvito Pio
- Dept. of Computer Science, University of Bari Aldo Moro, Bari, Italy.
- Big Data Lab, National Interuniversity Consortium for Informatics (CINI), Rome, Italy.
| | - Gennaro Vessio
- Dept. of Computer Science, University of Bari Aldo Moro, Bari, Italy
| | - Vincenzo Volpe
- Dept. of Computer Science, University of Bari Aldo Moro, Bari, Italy
| |
Collapse
|
12
|
Multi-dimensional experimental and computational exploration of metabolism pinpoints complex probiotic interactions. Metab Eng 2023; 76:120-132. [PMID: 36720400 DOI: 10.1016/j.ymben.2023.01.008] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 12/13/2022] [Accepted: 01/21/2023] [Indexed: 01/29/2023]
Abstract
Multi-strain probiotics are widely regarded as effective products for improving gut microbiota stability and host health, providing advantages over single-strain probiotics. However, in general, it is unclear to what extent different strains would cooperate or compete for resources, and how the establishment of a common biofilm microenvironment could influence their interactions. In this work, we develop an integrative experimental and computational approach to comprehensively assess the metabolic functionality and interactions of probiotics across growth conditions. Our approach combines co-culture assays with genome-scale modelling of metabolism and multivariate data analysis, thus exploiting complementary data- and knowledge-driven systems biology techniques. To show the advantages of the proposed approach, we apply it to the study of the interactions between two widely used probiotic strains of Lactobacillus reuteri and Saccharomyces boulardii, characterising their production potential for compounds that can be beneficial to human health. Our results show that these strains can establish a mixed cooperative-antagonistic interaction best explained by competition for shared resources, with an increased individual exchange but an often decreased net production of amino acids and short-chain fatty acids. Overall, our work provides a strategy that can be used to explore microbial metabolic fingerprints of biotechnological interest, capable of capturing multifaceted equilibria even in simple microbial consortia.
Collapse
|
13
|
Mwanga EP, Siria DJ, Mitton J, Mshani IH, González-Jiménez M, Selvaraj P, Wynne K, Baldini F, Okumu FO, Babayan SA. Using transfer learning and dimensionality reduction techniques to improve generalisability of machine-learning predictions of mosquito ages from mid-infrared spectra. BMC Bioinformatics 2023; 24:11. [PMID: 36624386 PMCID: PMC9830685 DOI: 10.1186/s12859-022-05128-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 12/26/2022] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Old mosquitoes are more likely to transmit malaria than young ones. Therefore, accurate prediction of mosquito population age can drastically improve the evaluation of mosquito-targeted interventions. However, standard methods for age-grading mosquitoes are laborious and costly. We have shown that Mid-infrared spectroscopy (MIRS) can be used to detect age-specific patterns in mosquito cuticles and thus can be used to train age-grading machine learning models. However, these models tend to transfer poorly across populations. Here, we investigate whether applying dimensionality reduction and transfer learning to MIRS data can improve the transferability of MIRS-based predictions for mosquito ages. METHODS We reared adults of the malaria vector Anopheles arabiensis in two insectaries. The heads and thoraces of female mosquitoes were scanned using an attenuated total reflection-Fourier transform infrared spectrometer, which were grouped into two different age classes. The dimensionality of the spectra data was reduced using unsupervised principal component analysis or t-distributed stochastic neighbour embedding, and then used to train deep learning and standard machine learning classifiers. Transfer learning was also evaluated to improve transferability of the models when predicting mosquito age classes from new populations. RESULTS Model accuracies for predicting the age of mosquitoes from the same population as the training samples reached 99% for deep learning and 92% for standard machine learning. However, these models did not generalise to a different population, achieving only 46% and 48% accuracy for deep learning and standard machine learning, respectively. Dimensionality reduction did not improve model generalizability but reduced computational time. Transfer learning by updating pre-trained models with 2% of mosquitoes from the alternate population improved performance to ~ 98% accuracy for predicting mosquito age classes in the alternative population. CONCLUSION Combining dimensionality reduction and transfer learning can reduce computational costs and improve the transferability of both deep learning and standard machine learning models for predicting the age of mosquitoes. Future studies should investigate the optimal quantities and diversity of training data necessary for transfer learning and the implications for broader generalisability to unseen datasets.
Collapse
Affiliation(s)
- Emmanuel P. Mwanga
- grid.414543.30000 0000 9144 642XEnvironmental Health and Ecological Sciences Department, Ifakara Health Institute, Morogoro, Tanzania ,grid.8756.c0000 0001 2193 314XSchool of Biodiversity, One Health, and Veterinary Medicine, University of Glasgow, Glasgow, G12 8QQ UK
| | - Doreen J. Siria
- grid.414543.30000 0000 9144 642XEnvironmental Health and Ecological Sciences Department, Ifakara Health Institute, Morogoro, Tanzania
| | - Joshua Mitton
- grid.8756.c0000 0001 2193 314XSchool of Biodiversity, One Health, and Veterinary Medicine, University of Glasgow, Glasgow, G12 8QQ UK ,grid.8756.c0000 0001 2193 314XSchool of Computing Science, University of Glasgow, Glasgow, G12 8QQ UK
| | - Issa H. Mshani
- grid.414543.30000 0000 9144 642XEnvironmental Health and Ecological Sciences Department, Ifakara Health Institute, Morogoro, Tanzania ,grid.8756.c0000 0001 2193 314XSchool of Biodiversity, One Health, and Veterinary Medicine, University of Glasgow, Glasgow, G12 8QQ UK
| | - Mario González-Jiménez
- grid.8756.c0000 0001 2193 314XSchool of Chemistry, University of Glasgow, Glasgow, G12 8QQ UK
| | | | - Klaas Wynne
- grid.8756.c0000 0001 2193 314XSchool of Chemistry, University of Glasgow, Glasgow, G12 8QQ UK
| | - Francesco Baldini
- grid.8756.c0000 0001 2193 314XSchool of Biodiversity, One Health, and Veterinary Medicine, University of Glasgow, Glasgow, G12 8QQ UK
| | - Fredros O. Okumu
- grid.414543.30000 0000 9144 642XEnvironmental Health and Ecological Sciences Department, Ifakara Health Institute, Morogoro, Tanzania ,grid.8756.c0000 0001 2193 314XSchool of Biodiversity, One Health, and Veterinary Medicine, University of Glasgow, Glasgow, G12 8QQ UK ,grid.11951.3d0000 0004 1937 1135School of Public Health, University of Witwatersrand, Johannesburg, South Africa
| | - Simon A. Babayan
- grid.8756.c0000 0001 2193 314XSchool of Biodiversity, One Health, and Veterinary Medicine, University of Glasgow, Glasgow, G12 8QQ UK
| |
Collapse
|
14
|
Shtar G, Greenstein-Messica A, Mazuz E, Rokach L, Shapira B. Predicting drug characteristics using biomedical text embedding. BMC Bioinformatics 2022; 23:526. [PMID: 36476573 PMCID: PMC9730627 DOI: 10.1186/s12859-022-05083-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 11/25/2022] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Drug-drug interactions (DDIs) are preventable causes of medical injuries and often result in doctor and emergency room visits. Previous research demonstrates the effectiveness of using matrix completion approaches based on known drug interactions to predict unknown Drug-drug interactions. However, in the case of a new drug, where there is limited or no knowledge regarding the drug's existing interactions, such an approach is unsuitable, and other drug's preferences can be used to accurately predict new Drug-drug interactions. METHODS We propose adjacency biomedical text embedding (ABTE) to address this limitation by using a hybrid approach which combines known drugs' interactions and the drug's biomedical text embeddings to predict the DDIs of both new and well known drugs. RESULTS Our evaluation demonstrates the superiority of this approach compared to recently published DDI prediction models and matrix factorization-based approaches. Furthermore, we compared the use of different text embedding methods in ABTE, and found that the concept embedding approach, which involves biomedical information in the embedding process, provides the highest performance for this task. Additionally, we demonstrate the effectiveness of leveraging biomedical text embedding for additional drugs' biomedical prediction task by presenting text embedding's contribution to a multi-modal pregnancy drug safety classification. CONCLUSION Text and concept embeddings created by analyzing a domain-specific large-scale biomedical corpora can be used for predicting drug-related properties such as Drug-drug interactions and drug safety prediction. Prediction models based on the embeddings resulted in comparable results to hand-crafted features, however text embeddings do not require manual categorization or data collection and rely solely on the published literature.
Collapse
Affiliation(s)
- Guy Shtar
- grid.7489.20000 0004 1937 0511Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Asnat Greenstein-Messica
- grid.7489.20000 0004 1937 0511Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Eyal Mazuz
- grid.7489.20000 0004 1937 0511Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Lior Rokach
- grid.7489.20000 0004 1937 0511Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Bracha Shapira
- grid.7489.20000 0004 1937 0511Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| |
Collapse
|
15
|
Magazzù G, Zampieri G, Angione C. Clinical stratification improves the diagnostic accuracy of small omics datasets within machine learning and genome-scale metabolic modelling methods. Comput Biol Med 2022; 151:106244. [PMID: 36343407 DOI: 10.1016/j.compbiomed.2022.106244] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 10/07/2022] [Accepted: 10/22/2022] [Indexed: 12/27/2022]
Abstract
BACKGROUND Recently, multi-omic machine learning architectures have been proposed for the early detection of cancer. However, for rare cancers and their associated small datasets, it is still unclear how to use the available multi-omics data to achieve a mechanistic prediction of cancer onset and progression, due to the limited data available. Hepatoblastoma is the most frequent liver cancer in infancy and childhood, and whose incidence has been lately increasing in several developed countries. Even though some studies have been conducted to understand the causes of its onset and discover potential biomarkers, the role of metabolic rewiring has not been investigated in depth so far. METHODS Here, we propose and implement an interpretable multi-omics pipeline that combines mechanistic knowledge from genome-scale metabolic models with machine learning algorithms, and we use it to characterise the underlying mechanisms controlling hepatoblastoma. RESULTS AND CONCLUSIONS While the obtained machine learning models generally present a high diagnostic classification accuracy, our results show that the type of omics combinations used as input to the machine learning models strongly affects the detection of important genes, reactions and metabolic pathways linked to hepatoblastoma. Our method also suggests that, in the context of computer-aided diagnosis of cancer, optimal diagnostic accuracy can be achieved by adopting a combination of omics that depends on the patient's clinical characteristics.
Collapse
Affiliation(s)
- Giuseppe Magazzù
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, England, United Kingdom
| | - Guido Zampieri
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, England, United Kingdom; Department of Biology, University of Padova, Padova, Italy
| | - Claudio Angione
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, England, United Kingdom; Centre for Digital Innovation, Teesside University, Middlesbrough, England, United Kingdom; National Horizons Centre, Teesside University, Darlington, England, United Kingdom.
| |
Collapse
|
16
|
Muneeb M, Feng S, Henschel A. Transfer learning for genotype-phenotype prediction using deep learning models. BMC Bioinformatics 2022; 23:511. [PMID: 36447153 PMCID: PMC9710151 DOI: 10.1186/s12859-022-05036-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 11/05/2022] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND For some understudied populations, genotype data is minimal for genotype-phenotype prediction. However, we can use the data of some other large populations to learn about the disease-causing SNPs and use that knowledge for the genotype-phenotype prediction of small populations. This manuscript illustrated that transfer learning is applicable for genotype data and genotype-phenotype prediction. RESULTS Using HAPGEN2 and PhenotypeSimulator, we generated eight phenotypes for 500 cases/500 controls (CEU, large population) and 100 cases/100 controls (YRI, small populations). We considered 5 (4 phenotypes) and 10 (4 phenotypes) different risk SNPs for each phenotype to evaluate the proposed method. The improved accuracy with transfer learning for eight different phenotypes was between 2 and 14.2 percent. The two-tailed p-value between the classification accuracies for all phenotypes without transfer learning and with transfer learning was 0.0306 for five risk SNPs phenotypes and 0.0478 for ten risk SNPs phenotypes. CONCLUSION The proposed pipeline is used to transfer knowledge for the case/control classification of the small population. In addition, we argue that this method can also be used in the realm of endangered species and personalized medicine. If the large population data is extensive compared to small population data, expect transfer learning results to improve significantly. We show that Transfer learning is capable to create powerful models for genotype-phenotype predictions in large, well-studied populations and fine-tune these models to populations were data is sparse.
Collapse
Affiliation(s)
- Muhammad Muneeb
- grid.440568.b0000 0004 1762 9729Department of Electrical Engineering and Computer Science, Khalifa University of Science and Technology, Al Saada St - Zone 1, Abu Dhabi, United Arab Emirates
| | - Samuel Feng
- grid.449223.a0000 0004 1754 9534Department of Science and Engineering, Sorbonne University Abu Dhabi, PO Box 38044, Abu Dhabi, United Arab Emirates
| | - Andreas Henschel
- grid.440568.b0000 0004 1762 9729Department of Electrical Engineering and Computer Science, Khalifa University of Science and Technology, Al Saada St - Zone 1, Abu Dhabi, United Arab Emirates
| |
Collapse
|
17
|
Zhang L, Xu J, Chu X, Zhang H, Yao X, Zhang J, Guo Y. Identification of the cuproptosis-related molecular subtypes and an immunotherapy prognostic model in hepatocellular carcinoma. BMC Bioinformatics 2022; 23:485. [PMID: 36384423 PMCID: PMC9667659 DOI: 10.1186/s12859-022-04997-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/20/2022] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Cuproptosis, a newly discovered mode of cell death, has been less studied in hepatocellular carcinoma (HCC). Exploring the molecular characteristics of different subtypes of HCC based on cuproptosis-related genes (CRGs) is meaningful to HCC. In addition, immunotherapy plays a pivotal role in treating HCC. Exploring the sensitivity of immunotherapy and building predictive models are critical for HCC. METHODS The 357 HCC samples from the TCGA database were classified into three subtypes, Cluster 1, Cluster 2, and Cluster 3, based on the expression levels of ten CRGs genes using consensus clustering. Six machine learning algorithms were used to build models that identified the three subtypes. The molecular features of the three subtypes were analyzed and compared from some perspectives. Moreover, based on the differentially expressed genes (DEGs) between Cluster 1 and Cluster 3, a prognostic scoring model was constructed using LASSO regression and Cox regression, and the scoring model was used to predict the efficacy of immunotherapy in the IMvigor210 cohort. RESULTS Cluster 3 had the worst overall survival compared to Cluster 1 and Cluster 2 (P = 0.0048). The AUC of the Catboost model used to identify Cluster 3 was 0.959. Cluster 3 was significantly different from the other two subtypes in gene mutation, tumor mutation burden, tumor microenvironment, the expression of immune checkpoint inhibitor genes and N6-methyladenosine regulatory genes, and the sensitivity to sorafenib. We believe Cluster 3 is more sensitive to immunotherapy from the above analysis results. Therefore, based on the DEGs between Cluster 1 and Cluster 3, we obtained a 7-gene scoring prognostic model, which achieved meaningful results in predicting immunotherapy efficacy in the IMvigor210 cohort (P = 0.013). CONCLUSIONS Our study provides new ideas for molecular characterization and immunotherapy of HCC from machine learning and bioinformatics. Moreover, we successfully constructed a prognostic model of immunotherapy.
Collapse
Affiliation(s)
- Li Zhang
- grid.460069.dDepartment of Oncology, The Fifth Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Jingwei Xu
- grid.460069.dDepartment of Oncology, The Fifth Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Xiufeng Chu
- grid.460069.dDepartment of Oncology, The Fifth Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Hongqiao Zhang
- grid.460069.dDepartment of Oncology, The Fifth Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Xueyuan Yao
- grid.460069.dDepartment of Oncology, The Fifth Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Jian Zhang
- grid.460069.dDepartment of Oncology, The Fifth Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Yanwei Guo
- Department of Oncology, The Fifth Affiliated Hospital of Zhengzhou University, Zhengzhou, China.
| |
Collapse
|
18
|
Transfer how much: a fine-grained measure of the knowledge transferability of user behavior sequences in social network. Data Min Knowl Discov 2022. [DOI: 10.1007/s10618-022-00857-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
19
|
Chen H, Liu J, Hua C, Feng J, Pang B, Cao D, Li C. Accurate classification of white blood cells by coupling pre-trained ResNet and DenseNet with SCAM mechanism. BMC Bioinformatics 2022; 23:282. [PMID: 35840897 PMCID: PMC9287918 DOI: 10.1186/s12859-022-04824-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 07/07/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Via counting the different kinds of white blood cells (WBCs), a good quantitative description of a person's health status is obtained, thus forming the critical aspects for the early treatment of several diseases. Thereby, correct classification of WBCs is crucial. Unfortunately, the manual microscopic evaluation is complicated, time-consuming, and subjective, so its statistical reliability becomes limited. Hence, the automatic and accurate identification of WBCs is of great benefit. However, the similarity between WBC samples and the imbalance and insufficiency of samples in the field of medical computer vision bring challenges to intelligent and accurate classification of WBCs. To tackle these challenges, this study proposes a deep learning framework by coupling the pre-trained ResNet and DenseNet with SCAM (spatial and channel attention module) for accurately classifying WBCs. RESULTS In the proposed network, ResNet and DenseNet enables information reusage and new information exploration, respectively, which are both important and compatible for learning good representations. Meanwhile, the SCAM module sequentially infers attention maps from two separate dimensions of space and channel to emphasize important information or suppress unnecessary information, further enhancing the representation power of our model for WBCs to overcome the limitation of sample similarity. Moreover, the data augmentation and transfer learning techniques are used to handle the data of imbalance and insufficiency. In addition, the mixup approach is adopted for modeling the vicinity relation across training samples of different categories to increase the generalizability of the model. By comparing with five representative networks on our developed LDWBC dataset and the publicly available LISC, BCCD, and Raabin WBC datasets, our model achieves the best overall performance. We also implement the occlusion testing by the gradient-weighted class activation mapping (Grad-CAM) algorithm to improve the interpretability of our model. CONCLUSION The proposed method has great potential for application in intelligent and accurate classification of WBCs.
Collapse
Affiliation(s)
- Hua Chen
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Juan Liu
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, 430072, China.
| | - Chunbing Hua
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Jing Feng
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Baochuan Pang
- Landing Artificial Intelligence Center for Pathological Diagnosis, Wuhan, 430072, China
| | - Dehua Cao
- Landing Artificial Intelligence Center for Pathological Diagnosis, Wuhan, 430072, China
| | - Cheng Li
- Landing Artificial Intelligence Center for Pathological Diagnosis, Wuhan, 430072, China
| |
Collapse
|
20
|
Enhancement of Image Classification Using Transfer Learning and GAN-Based Synthetic Data Augmentation. MATHEMATICS 2022. [DOI: 10.3390/math10091541] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]
Abstract
Plastic bottle recycling has a crucial role in environmental degradation and protection. Position and background should be the same to classify plastic bottles on a conveyor belt. The manual detection of plastic bottles is time consuming and leads to human error. Hence, the automatic classification of plastic bottles using deep learning techniques can assist with the more accurate results and reduce cost. To achieve a considerably good result using the DL model, we need a large volume of data to train. We propose a GAN-based model to generate synthetic images similar to the original. To improve the image synthesis quality with less training time and decrease the chances of mode collapse, we propose a modified lightweight-GAN model, which consists of a generator and a discriminator with an auto-encoding feature to capture essential parts of the input image and to encourage the generator to produce a wide range of real data. Then a newly designed weighted average ensemble model based on two pre-trained models, inceptionV3 and xception, to classify transparent plastic bottles obtains an improved classification accuracy of 99.06%.
Collapse
|
21
|
Abstract
In fault-diagnosis classification, a pressing issue is the lack of target-fault samples. Obtaining fault data requires a great amount of time, energy and financial resources. These factors affect the accuracy of diagnosis. To address this problem, a novel fault-diagnosis-classification optimization method, namely TLSCA-SVM, which combines the sine cosine algorithm and support vector machine (SCA-SVM) with transfer learning, is proposed here. Considering the availability of fault data, this thesis uses the data generated by analog circuits from different faults for analysis. Firstly, the data signal is collected from different faults of the analog circuit, and then the characteristic data are extracted from the data signals by the wavelet packets. Secondly, to employ the principal component analysis (PCA) reduces the feature-value dimension. Lastly, as an auxiliary condition, the error-penalty item is added to the objective function of the SCA-SVM classifier to construct an innovative fault-diagnosis model namely TLSCA-SVM. Among them, the Sallen–Key bandpass filter circuit and the CSTV filter circuit are used to provide the data for horizontal- and vertical-contrast classification results. Comparing the SCA with the five optimization algorithms, it is concluded that the performance of SCA optimization parameters has certain advantages in the classification accuracy and speed. Additionally, to prove the superiority of the SCA-SVM classification algorithm, the five classification algorithms are compared with the SCA-SVM algorithm. Simulation results showed that the SCA-SVM classification has higher precision and a faster response time compared to the others. After adding the error penalty term to SCA-SVM, TLSCA-SVM requires fewer fault samples to process fault diagnosis. Ultimately, the method which is proposed could not only perform fault diagnosis effectively and quickly, but also could run effectively to achieve the effect of transfer learning in the case of less failure data.
Collapse
|
22
|
An Effective Multi-Scale Feature Network for Detecting Connector Solder Joint Defects. MACHINES 2022. [DOI: 10.3390/machines10020094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
With the rapid development of industry, people’s requirements for the functionality, stability, and safety of electronic products are becoming higher and higher. As an important medium for power supply and information transmission functions of electronic products, high-quality soldering of cables and connectors ensures that the devices can operate normally. In this paper, we propose a multi-level feature detection network based on multi-level feature maps fusion and feature enhancement for detecting connector solder joints, classifying and locating qualified solder joints, and detecting seven common defective solder joints. This paper proposes a new feature map up-sampling algorithm and introduces a feature enhancement module, which better preserves the semantic information of higher-level feature maps, while at the same time enhancing the fused feature maps and weakening the effect of noise. Through comparison experiments, the mAP of the network proposed in this paper reaches 0.929 and the top-1 accuracy reaches 92%. The detection capability of each type of solder joint is greatly improved compared with the effect of other networks, which can assist engineers in the detection of weld joint quality and thus reduce the workload.
Collapse
|