1
|
Liao L, Xie M, Zheng X, Zhou Z, Deng Z, Gao J. Molecular insights fast-tracked: AI in biosynthetic pathway research. Nat Prod Rep 2025; 42:911-936. [PMID: 40130306 DOI: 10.1039/d4np00003j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/26/2025]
Abstract
Covering: 2000 to 2025This review explores the potential of artificial intelligence (AI) in addressing challenges and accelerating molecular insights in biosynthetic pathway research, which is crucial for developing bioactive natural products with applications in pharmacology, agriculture, and biotechnology. It provides an overview of various AI techniques relevant to this research field, including machine learning (ML), deep learning (DL), natural language processing, network analysis, and data mining. AI-powered applications across three main areas, namely, pathway discovery and mining, pathway design, and pathway optimization, are discussed, and the benefits and challenges of integrating omics data and AI for enhanced pathway research are also elucidated. This review also addresses the current limitations, future directions, and the importance of synergy between AI and experimental approaches in unlocking rapid advancements in biosynthetic pathway research. The review concludes with an evaluation of AI's current capabilities and future outlook, emphasizing the transformative impact of AI on biosynthetic pathway research and the potential for new opportunities in the discovery and optimization of bioactive natural products.
Collapse
Affiliation(s)
- Lijuan Liao
- Key BioAI Synthetica Lab for Natural Product Drug Discovery, College of Bee, Biomedical and Pharmaceutical Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China.
- State Key Laboratory of Microbial Technology, Shandong University, Qingdao 266237, P. R. China
| | - Mengjun Xie
- Key BioAI Synthetica Lab for Natural Product Drug Discovery, College of Bee, Biomedical and Pharmaceutical Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China.
| | - Xiaoshan Zheng
- Key BioAI Synthetica Lab for Natural Product Drug Discovery, College of Bee, Biomedical and Pharmaceutical Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China.
| | - Zhao Zhou
- Key BioAI Synthetica Lab for Natural Product Drug Discovery, College of Bee, Biomedical and Pharmaceutical Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China.
| | - Zixin Deng
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| | - Jiangtao Gao
- Key BioAI Synthetica Lab for Natural Product Drug Discovery, College of Bee, Biomedical and Pharmaceutical Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China.
| |
Collapse
|
2
|
Kong Y, Chen H, Huang X, Chang L, Yang B, Chen W. Precise metabolic modeling in post-omics era: accomplishments and perspectives. Crit Rev Biotechnol 2025; 45:683-701. [PMID: 39198033 DOI: 10.1080/07388551.2024.2390089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 07/18/2024] [Accepted: 07/23/2024] [Indexed: 09/01/2024]
Abstract
Microbes have been extensively utilized for their sustainable and scalable properties in synthesizing desired bio-products. However, insufficient knowledge about intracellular metabolism has impeded further microbial applications. The genome-scale metabolic models (GEMs) play a pivotal role in facilitating a global understanding of cellular metabolic mechanisms. These models enable rational modification by exploring metabolic pathways and predicting potential targets in microorganisms, enabling precise cell regulation without experimental costs. Nonetheless, simplified GEM only considers genome information and network stoichiometry while neglecting other important bio-information, such as enzyme functions, thermodynamic properties, and kinetic parameters. Consequently, uncertainties persist particularly when predicting microbial behaviors in complex and fluctuant systems. The advent of the omics era with its massive quantification of genes, proteins, and metabolites under various conditions has led to the flourishing of multi-constrained models and updated algorithms with improved predicting power and broadened dimension. Meanwhile, machine learning (ML) has demonstrated exceptional analytical and predictive capacities when applied to training sets of biological big data. Incorporating the discriminant strength of ML with GEM facilitates mechanistic modeling efficiency and improves predictive accuracy. This paper provides an overview of research innovations in the GEM, including multi-constrained modeling, analytical approaches, and the latest applications of ML, which may contribute comprehensive knowledge toward genetic refinement, strain development, and yield enhancement for a broad range of biomolecules.
Collapse
Affiliation(s)
- Yawen Kong
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, P. R. China
- School of Food Science and Technology, Jiangnan University, Wuxi, P. R. China
| | - Haiqin Chen
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, P. R. China
- School of Food Science and Technology, Jiangnan University, Wuxi, P. R. China
| | - Xinlei Huang
- The Key Laboratory of Industrial Biotechnology, School of Biotechnology, Jiangnan University, Wuxi, P. R. China
| | - Lulu Chang
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, P. R. China
- School of Food Science and Technology, Jiangnan University, Wuxi, P. R. China
| | - Bo Yang
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, P. R. China
- School of Food Science and Technology, Jiangnan University, Wuxi, P. R. China
| | - Wei Chen
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, P. R. China
- School of Food Science and Technology, Jiangnan University, Wuxi, P. R. China
- National Engineering Research Center for Functional Food, Jiangnan University, Wuxi, P. R. China
| |
Collapse
|
3
|
Hu J, Zhang Y, Xie J, Yuan Z, Yin Z, Shi S, Li H, Li S. Learning motif features and topological structure of molecules for metabolic pathway prediction. J Cheminform 2025; 17:56. [PMID: 40259421 PMCID: PMC12013036 DOI: 10.1186/s13321-025-00994-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 03/21/2025] [Indexed: 04/23/2025] Open
Abstract
Metabolites serve as crucial biomarkers for assessing disease progression and understanding underlying pathogenic mechanisms. However, when the metabolic pathway category of metabolites is unknown, researchers face challenges in conducting metabolomic analyses. Due to the complexity of wet laboratory experimentation for pathway identification, there is a growing demand for predictive methods. Various computational approaches, including machine learning and graph neural networks, have been proposed; however, interpretability remains a challenge. We have developed a neural network framework called MotifMol3D, which is designed for predicting molecular metabolic pathway categories. This framework introduces motif information to mine local features of small-sample molecules, combining with graph neural network and 3D information to complete the prediction task. Using a dataset of 5,698 molecules that participate in 11 metabolic pathway categories in the KEGG database, MotifMol3D outperformed state-of-the-art methods in precision, recall, and F1 score. In addition, ablation study and motif analysis have demonstrated the effectiveness and usefulness of the model. Motif analysis, in particular, has shown motif information can actually characterize the main features of specific pathway molecules to a certain extent and enhance the interpretability of the model. An external validation further corroborates this observation. MotifMol3D is an open-source tool that is available at https://github.com/Irena-Zhang/MotifMol3D.git .Scientific contribution MotifMol3D integrates motif information, graph neural networks, and 3D structural data to enhance feature extraction for small-sample molecules, improving the precision and interpretability of metabolic pathway predictions. The model outperforms state-of-the-art approaches in precision, recall, and F1 score. This work reveals how motif information characterizes pathway-specific molecules, offering novel insights into molecular properties within metabolic pathways.
Collapse
Affiliation(s)
- Jianguo Hu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Yiqing Zhang
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Jinxin Xie
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Zhen Yuan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Zhangxiang Yin
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Shanshan Shi
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Honglin Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
- Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.
- Lingang Laboratory, Shanghai, 200031, China.
| | - Shiliang Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
- Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
4
|
Huckvale ED, Moseley HNB. Chemical representation standardization needed to generalize metabolic pathway involvement prediction across the Kyoto Encyclopedia of Genes and Genomes, Reactome, and MetaCyc knowledgebases. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.04.02.646918. [PMID: 40291671 PMCID: PMC12026579 DOI: 10.1101/2025.04.02.646918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/30/2025]
Abstract
Motivation Due to the utility of knowing the pathway involvement of compounds detected in biological experiments, knowledgebases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and MetaCyc have aggregated pathway annotations of compounds. However, these annotations are largely incomplete and are costly to obtain experimentally and curate from published scientific literature. Results We constructed a new dataset using compounds and their pathway annotations from KEGG, Reactome, and MetaCyc. Using this dataset, we trained and tested an extreme classification model that classifies 8,195 unique pathways based on compound chemical representations with a mean Matthews correlation coefficient (MCC) of 0.9036 ± 0.0033. During model evaluation, we discovered an inconsistency in chemical representations across knowledgebases, which was alleviated by standardizing the chemical representations using InChI (IUPAC International Chemical Identifier) canonicalization. Next, we compared the MCC between compounds and their cross-knowledgebase references. The non-standardized chemical representations had a huge 0.2687 drop in MCC while the standardized chemical representations only had a 0.0384 drop in MCC. Thus, standardizing chemical representation is an essential step when predicting on novel chemical representations. Availability and implementation All code and data for reproducing the results of this manuscript are available in the following figshare items: Manuscript main results: https://doi.org/10.6084/m9.figshare.28701845 CV analysis of model and dataset of prior studies: https://doi.org/10.6084/m9.figshare.28701590. Contact hunter.moseley@uky.edu. Supplementary information .
Collapse
|
5
|
Basnet BB, Zhou ZY, Wei B, Wang H. Advances in AI-based strategies and tools to facilitate natural product and drug development. Crit Rev Biotechnol 2025:1-32. [PMID: 40159111 DOI: 10.1080/07388551.2025.2478094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2024] [Revised: 02/11/2025] [Accepted: 02/16/2025] [Indexed: 04/02/2025]
Abstract
Natural products and their derivatives have been important for treating diseases in humans, animals, and plants. However, discovering new structures from natural sources is still challenging. In recent years, artificial intelligence (AI) has greatly aided the discovery and development of natural products and drugs. AI facilitates to: connect genetic data to chemical structures or vice-versa, repurpose known natural products, predict metabolic pathways, and design and optimize metabolites biosynthesis. More recently, the emergence and improvement in neural networks such as deep learning and ensemble automated web based bioinformatics platforms have sped up the discovery process. Meanwhile, AI also improves the identification and structure elucidation of unknown compounds from raw data like mass spectrometry and nuclear magnetic resonance. This article reviews these AI-driven methods and tools, highlighting their practical applications and guide for efficient natural product discovery and drug development.
Collapse
Affiliation(s)
- Buddha Bahadur Basnet
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, China
- Central Department of Biotechnology, Tribhuvan University, Kathmandu, Nepal
| | - Zhen-Yi Zhou
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, China
| | - Bin Wei
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, China
| | - Hong Wang
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, China
- Key Laboratory of Marine Fishery Resources Exploitment, Utilization of Zhejiang Province, Zhejiang University of Technology, Hangzhou, China
| |
Collapse
|
6
|
Kiouri DP, Batsis GC, Chasapis CT. Structure-Based Approaches for Protein-Protein Interaction Prediction Using Machine Learning and Deep Learning. Biomolecules 2025; 15:141. [PMID: 39858535 PMCID: PMC11763140 DOI: 10.3390/biom15010141] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Revised: 01/11/2025] [Accepted: 01/14/2025] [Indexed: 01/27/2025] Open
Abstract
Protein-Protein Interaction (PPI) prediction plays a pivotal role in understanding cellular processes and uncovering molecular mechanisms underlying health and disease. Structure-based PPI prediction has emerged as a robust alternative to sequence-based methods, offering greater biological accuracy by integrating three-dimensional spatial and biochemical features. This work summarizes the recent advances in computational approaches leveraging protein structure information for PPI prediction, focusing on machine learning (ML) and deep learning (DL) techniques. These methods not only improve predictive accuracy but also provide insights into functional sites, such as binding and catalytic residues. However, challenges such as limited high-resolution structural data and the need for effective negative sampling persist. Through the integration of experimental and computational tools, structure-based prediction paves the way for comprehensive proteomic network analysis, holding promise for advancements in drug discovery, biomarker identification, and personalized medicine. Future directions include enhancing scalability and dataset reliability to expand these approaches across diverse proteomes.
Collapse
Affiliation(s)
- Despoina P. Kiouri
- Institute of Chemical Biology, National Hellenic Research Foundation, 11635 Athens, Greece; (D.P.K.); (G.C.B.)
- Laboratory of Organic Chemistry, Department of Chemistry, National and Kapodistrian University of Athens, 15772 Athens, Greece
| | - Georgios C. Batsis
- Institute of Chemical Biology, National Hellenic Research Foundation, 11635 Athens, Greece; (D.P.K.); (G.C.B.)
| | - Christos T. Chasapis
- Institute of Chemical Biology, National Hellenic Research Foundation, 11635 Athens, Greece; (D.P.K.); (G.C.B.)
| |
Collapse
|
7
|
Luo Y, Zhao C, Chen F. Multiomics Research: Principles and Challenges in Integrated Analysis. BIODESIGN RESEARCH 2024; 6:0059. [PMID: 39990095 PMCID: PMC11844812 DOI: 10.34133/bdr.0059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 10/24/2024] [Accepted: 10/28/2024] [Indexed: 02/25/2025] Open
Abstract
Multiomics research is a transformative approach in the biological sciences that integrates data from genomics, transcriptomics, proteomics, metabolomics, and other omics technologies to provide a comprehensive understanding of biological systems. This review elucidates the fundamental principles of multiomics, emphasizing the necessity of data integration to uncover the complex interactions and regulatory mechanisms underlying various biological processes. We explore the latest advances in computational methodologies, including deep learning, graph neural networks (GNNs), and generative adversarial networks (GANs), which facilitate the effective synthesis and interpretation of multiomics data. Additionally, this review addresses the critical challenges in this field, such as data heterogeneity, scalability, and the need for robust, interpretable models. We highlight the potential of large language models to enhance multiomics analysis through automated feature extraction, natural language generation, and knowledge integration. Despite the important promise of multiomics, the review acknowledges the substantial computational resources required and the complexity of model tuning, underscoring the need for ongoing innovation and collaboration in the field. This comprehensive analysis aims to guide researchers in navigating the principles and challenges of multiomics research to foster advances in integrative biological analysis.
Collapse
Affiliation(s)
- Yunqing Luo
- National Key Laboratory for Tropical Crop Breeding, College of Breeding and Multiplication, Sanya Institute of Breeding and Multiplication, Hainan University, Sanya 572025, China
- College of Tropical Agriculture and Forestry, Hainan University, Danzhou 571700, China
| | - Chengjun Zhao
- National Key Laboratory for Tropical Crop Breeding, College of Breeding and Multiplication, Sanya Institute of Breeding and Multiplication, Hainan University, Sanya 572025, China
- College of Tropical Agriculture and Forestry, Hainan University, Danzhou 571700, China
| | - Fei Chen
- National Key Laboratory for Tropical Crop Breeding, College of Breeding and Multiplication, Sanya Institute of Breeding and Multiplication, Hainan University, Sanya 572025, China
- College of Tropical Agriculture and Forestry, Hainan University, Danzhou 571700, China
| |
Collapse
|
8
|
Huckvale ED, Moseley HNB. Predicting the Pathway Involvement of All Pathway and Associated Compound Entries Defined in the Kyoto Encyclopedia of Genes and Genomes. Metabolites 2024; 14:582. [PMID: 39590818 PMCID: PMC11596622 DOI: 10.3390/metabo14110582] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Revised: 10/21/2024] [Accepted: 10/25/2024] [Indexed: 11/28/2024] Open
Abstract
Background/Objectives: Predicting the biochemical pathway involvement of a compound could facilitate the interpretation of biological and biomedical research. Prior prediction approaches have largely focused on metabolism, training machine learning models to solely predict based on metabolic pathways. However, there are many other types of pathways in cells and organisms that are of interest to biologists. Methods: While several publications have made use of the metabolites and metabolic pathways available in the Kyoto Encyclopedia of Genes and Genomes (KEGG), we downloaded all the compound entries with pathway annotations available in the KEGG. From these data, we constructed a dataset where each entry contained features representing compounds combined with features representing pathways, followed by a binary label indicating whether the given compound is associated with the given pathway. We trained multi-layer perceptron binary classifiers on variations of this dataset. Results: The models trained on 6485 KEGG compounds and 502 pathways scored an overall mean Matthews correlation coefficient (MCC) performance of 0.847, a median MCC of 0.848, and a standard deviation of 0.0098. Conclusions: This performance on all 502 KEGG pathways represents a roughly 6% improvement over the performance of models trained on only the 184 KEGG metabolic pathways, which had a mean MCC of 0.800 and a standard deviation of 0.021. These results demonstrate the capability to effectively predict biochemical pathways in general, in addition to those specifically related to metabolism. Moreover, the improvement in the performance demonstrates additional transfer learning with the inclusion of non-metabolic pathways.
Collapse
Affiliation(s)
- Erik D. Huckvale
- Markey Cancer Center, University of Kentucky, Lexington, KY 40536, USA;
- Superfund Research Center, University of Kentucky, Lexington, KY 40536, USA
| | - Hunter N. B. Moseley
- Markey Cancer Center, University of Kentucky, Lexington, KY 40536, USA;
- Superfund Research Center, University of Kentucky, Lexington, KY 40536, USA
- Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, KY 40536, USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY 40536, USA
| |
Collapse
|
9
|
Xie X, Gui L, Qiao B, Wang G, Huang S, Zhao Y, Sun S. Deep learning in template-free de novo biosynthetic pathway design of natural products. Brief Bioinform 2024; 25:bbae495. [PMID: 39373052 PMCID: PMC11456888 DOI: 10.1093/bib/bbae495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 09/12/2024] [Accepted: 09/20/2024] [Indexed: 10/08/2024] Open
Abstract
Natural products (NPs) are indispensable in drug development, particularly in combating infections, cancer, and neurodegenerative diseases. However, their limited availability poses significant challenges. Template-free de novo biosynthetic pathway design provides a strategic solution for NP production, with deep learning standing out as a powerful tool in this domain. This review delves into state-of-the-art deep learning algorithms in NP biosynthesis pathway design. It provides an in-depth discussion of databases like Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and UniProt, which are essential for model training, along with chemical databases such as Reaxys, SciFinder, and PubChem for transfer learning to expand models' understanding of the broader chemical space. It evaluates the potential and challenges of sequence-to-sequence and graph-to-graph translation models for accurate single-step prediction. Additionally, it discusses search algorithms for multistep prediction and deep learning algorithms for predicting enzyme function. The review also highlights the pivotal role of deep learning in improving catalytic efficiency through enzyme engineering, which is essential for enhancing NP production. Moreover, it examines the application of large language models in pathway design, enzyme discovery, and enzyme engineering. Finally, it addresses the challenges and prospects associated with template-free approaches, offering insights into potential advancements in NP biosynthesis pathway design.
Collapse
Affiliation(s)
- Xueying Xie
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Lin Gui
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Baixue Qiao
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Shan Huang
- Department of Neurology, The Second Affiliated Hospital, Harbin Medical University, No. 246 Xuefu Road, Nangang District,Harbin 150081, China
| | - Yuming Zhao
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Shanwen Sun
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| |
Collapse
|
10
|
Huckvale ED, Moseley HNB. Predicting the Association of Metabolites with Both Pathway Categories and Individual Pathways. Metabolites 2024; 14:510. [PMID: 39330517 PMCID: PMC11433779 DOI: 10.3390/metabo14090510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Revised: 09/04/2024] [Accepted: 09/20/2024] [Indexed: 09/28/2024] Open
Abstract
Metabolism is a network of chemical reactions that sustain cellular life. Parts of this metabolic network are defined as metabolic pathways containing specific biochemical reactions. Products and reactants of these reactions are called metabolites, which are associated with certain human-defined metabolic pathways. Metabolic knowledgebases, such as the Kyoto Encyclopedia of Gene and Genomes (KEGG) contain metabolites, reactions, and pathway annotations; however, such resources are incomplete due to current limits of metabolic knowledge. To fill in missing metabolite pathway annotations, past machine learning models showed some success at predicting the KEGG Level 2 pathway category involvement of metabolites based on their chemical structure. Here, we present the first machine learning model to predict metabolite association to more granular KEGG Level 3 metabolic pathways. We used a feature and dataset engineering approach to generate over one million metabolite-pathway entries in the dataset used to train a single binary classifier. This approach produced a mean Matthews correlation coefficient (MCC) of 0.806 ± 0.017 SD across 100 cross-validation iterations. The 172 Level 3 pathways were predicted with an overall MCC of 0.726. Moreover, metabolite association with the 12 Level 2 pathway categories was predicted with an overall MCC of 0.891, representing significant transfer learning from the Level 3 pathway entries. These are the best metabolite pathway prediction results published so far in the field.
Collapse
Affiliation(s)
- Erik D Huckvale
- Markey Cancer Center, University of Kentucky, Lexington, KY 40536, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40536, USA
| | - Hunter N B Moseley
- Markey Cancer Center, University of Kentucky, Lexington, KY 40536, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40536, USA
- Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, KY 40536, USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY 40536, USA
| |
Collapse
|
11
|
Kundu P, Beura S, Mondal S, Das AK, Ghosh A. Machine learning for the advancement of genome-scale metabolic modeling. Biotechnol Adv 2024; 74:108400. [PMID: 38944218 DOI: 10.1016/j.biotechadv.2024.108400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 05/13/2024] [Accepted: 06/23/2024] [Indexed: 07/01/2024]
Abstract
Constraint-based modeling (CBM) has evolved as the core systems biology tool to map the interrelations between genotype, phenotype, and external environment. The recent advancement of high-throughput experimental approaches and multi-omics strategies has generated a plethora of new and precise information from wide-ranging biological domains. On the other hand, the continuously growing field of machine learning (ML) and its specialized branch of deep learning (DL) provide essential computational architectures for decoding complex and heterogeneous biological data. In recent years, both multi-omics and ML have assisted in the escalation of CBM. Condition-specific omics data, such as transcriptomics and proteomics, helped contextualize the model prediction while analyzing a particular phenotypic signature. At the same time, the advanced ML tools have eased the model reconstruction and analysis to increase the accuracy and prediction power. However, the development of these multi-disciplinary methodological frameworks mainly occurs independently, which limits the concatenation of biological knowledge from different domains. Hence, we have reviewed the potential of integrating multi-disciplinary tools and strategies from various fields, such as synthetic biology, CBM, omics, and ML, to explore the biochemical phenomenon beyond the conventional biological dogma. How the integrative knowledge of these intersected domains has improved bioengineering and biomedical applications has also been highlighted. We categorically explained the conventional genome-scale metabolic model (GEM) reconstruction tools and their improvement strategies through ML paradigms. Further, the crucial role of ML and DL in omics data restructuring for GEM development has also been briefly discussed. Finally, the case-study-based assessment of the state-of-the-art method for improving biomedical and metabolic engineering strategies has been elaborated. Therefore, this review demonstrates how integrating experimental and in silico strategies can help map the ever-expanding knowledge of biological systems driven by condition-specific cellular information. This multiview approach will elevate the application of ML-based CBM in the biomedical and bioengineering fields for the betterment of society and the environment.
Collapse
Affiliation(s)
- Pritam Kundu
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Satyajit Beura
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Suman Mondal
- P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Amit Kumar Das
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Amit Ghosh
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India; P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India.
| |
Collapse
|
12
|
Huckvale ED, Moseley HN. Predicting the Pathway Involvement of Metabolites in Both Pathway Categories and Individual Pathways. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.07.607025. [PMID: 39149299 PMCID: PMC11326255 DOI: 10.1101/2024.08.07.607025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Metabolism is the network of chemical reactions that sustain cellular life. Parts of this metabolic network are defined as metabolic pathways containing specific biochemical reactions. Products and reactants of these reactions are called metabolites, which are associated with certain human-defined metabolic pathways. Metabolic knowledgebases, such as the Kyoto Encyclopedia of Gene and Genomes (KEGG) contain metabolites, reactions, and pathway annotations; however, such resources are incomplete due to current limits of metabolic knowledge. To fill in missing metabolite pathway annotations, past machine learning models showed some success at predicting KEGG Level 2 pathway category involvement of metabolites based on their chemical structure. Here, we present the first machine learning model to predict metabolite association to more granular KEGG Level 3 metabolic pathways. We used a feature and dataset engineering approach to generate over one million metabolite-pathway entries in the dataset used to train a single binary classifier. This approach produced a mean Matthews correlation coefficient (MCC) of 0.806 ± 0.017 SD across 100 cross-validations iterations. The 172 Level 3 pathways were predicted with an overall MCC of 0.726. Moreover, metabolite association with the 12 Level 2 pathway categories were predicted with an overall MCC of 0.891, representing significant transfer learning from the Level 3 pathway entries. These are the best metabolite-pathway prediction results published so far in the field.
Collapse
Affiliation(s)
- Erik D. Huckvale
- Markey Cancer Center, University of Kentucky, Lexington, KY, USA
- Superfund Research Center, University of Kentucky, Lexington, KY, USA
| | - Hunter N.B. Moseley
- Markey Cancer Center, University of Kentucky, Lexington, KY, USA
- Superfund Research Center, University of Kentucky, Lexington, KY, USA
- Department of Computer Science (Data Science Program), University of Kentucky, Lexington, KY, USA
- Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY, USA
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, KY, USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, USA
| |
Collapse
|
13
|
Tamang JP, Kharnaior P, Halami PM. Lactic acid bacteria in some Indian fermented foods and their predictive functional profiles. Braz J Microbiol 2024; 55:1745-1751. [PMID: 38337126 PMCID: PMC11153396 DOI: 10.1007/s42770-024-01251-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Accepted: 01/04/2024] [Indexed: 02/12/2024] Open
Abstract
Lactic acid bacteria (LAB) were isolated from naturally fermented foods of India, viz., sidra, a dried fish product; kinema, a naturally fermented sticky soybean food; and dahi, a naturally fermented milk product. Five strains of LAB, based on 16S rRNA gene sequence, were identified: Lactococcus lactis FS2 (from sidra), Lc. lactis C2D (dahi), Lc. lactis SP2C4 (kinema), Lactiplantibacillus plantarum DHCU70 (=Lactobacillus plantarum) (from dahi), and Lactiplantibacillus plantarum KP1 (kinema). The PICRUSt2 software, a bioinformatic tool, was applied to infer the raw sequences obtained from LAB strains mapped against KEGG database for predictive functionality. Functional features of LAB strains showed genes associated with metabolism (36.47%), environmental information processing (31.42%), genetic information processing (9.83%), and the unclassified (22.28%). KEGG database also showed abundant genes related to predictive membrane transport (29.25%) and carbohydrate metabolism (11.91%). This study may help in understanding the health-promoting benefits of the culturable LAB strains in fermented foods.
Collapse
Affiliation(s)
- Jyoti Prakash Tamang
- Department of Microbiology, School of Life Sciences, Sikkim University, Science Building, Dara Goan, Tadong, Gangtok, Sikkim, 737102, India.
| | - Pynhunlang Kharnaior
- Department of Microbiology, School of Life Sciences, Sikkim University, Science Building, Dara Goan, Tadong, Gangtok, Sikkim, 737102, India
| | - Prakash M Halami
- CSIR-Central Food Technological Research Institute, Microbiology and Fermentation Technology, Mysuru, Karnataka, 570020, India
| |
Collapse
|
14
|
Huckvale ED, Moseley HNB. A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement. PLoS One 2024; 19:e0299583. [PMID: 38696410 PMCID: PMC11065254 DOI: 10.1371/journal.pone.0299583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 02/13/2024] [Indexed: 05/04/2024] Open
Abstract
The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.
Collapse
Affiliation(s)
- Erik D. Huckvale
- Markey Cancer Center, University of Kentucky, Lexington, Kentucky, United States of America
| | - Hunter N. B. Moseley
- Markey Cancer Center, University of Kentucky, Lexington, Kentucky, United States of America
- Superfund Research Center, University of Kentucky, Lexington, Kentucky, United States of America
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, Kentucky, United States of America
- Institute for Biomedical Informatics, University of Kentucky, Lexington, Kentucky, United States of America
| |
Collapse
|
15
|
Bai W, Li C, Li W, Wang H, Han X, Wang P, Wang L. Machine learning assists prediction of genes responsible for plant specialized metabolite biosynthesis by integrating multi-omics data. BMC Genomics 2024; 25:418. [PMID: 38679745 PMCID: PMC11057162 DOI: 10.1186/s12864-024-10258-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Accepted: 03/26/2024] [Indexed: 05/01/2024] Open
Abstract
BACKGROUND Plant specialized (or secondary) metabolites (PSM), also known as phytochemicals, natural products, or plant constituents, play essential roles in interactions between plants and environment. Although many research efforts have focused on discovering novel metabolites and their biosynthetic genes, the resolution of metabolic pathways and identified biosynthetic genes was limited by rudimentary analysis approaches and enormous number of candidate genes. RESULTS Here we integrated state-of-the-art automated machine learning (ML) frame AutoGluon-Tabular and multi-omics data from Arabidopsis to predict genes encoding enzymes involved in biosynthesis of plant specialized metabolite (PSM), focusing on the three main PSM categories: terpenoids, alkaloids, and phenolics. We found that the related features of genomics and proteomics were the top two crucial categories of features contributing to the model performance. Using only these key features, we built a new model in Arabidopsis, which performed better than models built with more features including those related with transcriptomics and epigenomics. Finally, the built models were validated in maize and tomato, and models tested for maize and trained with data from two other species exhibited either equivalent or superior performance to intraspecies predictions. CONCLUSIONS Our external validation results in grape and poppy on the one hand implied the applicability of our model to the other species, and on the other hand showed enormous potential to improve the prediction of enzymes synthesizing PSM with the inclusion of valid data from a wider range of species.
Collapse
Affiliation(s)
- Wenhui Bai
- College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Taiyuan, 030024, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, China, 518000, Shenzhen
| | - Cheng Li
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, China, 518000, Shenzhen
| | - Wei Li
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, China, 518000, Shenzhen
| | - Hai Wang
- National Maize Improvement Center, Key Laboratory of Crop Heterosis and Utilization, Joint Laboratory for International Cooperation in Crop Molecular Breeding, China Agricultural University, Beijing, 100193, China
| | - Xiaohong Han
- College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Taiyuan, 030024, China.
| | - Peipei Wang
- Kunpeng Institute of Modern Agriculture at Foshan, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518124, China.
| | - Li Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, China, 518000, Shenzhen.
| |
Collapse
|
16
|
Joe H, Kim HG. Multi-label classification with XGBoost for metabolic pathway prediction. BMC Bioinformatics 2024; 25:52. [PMID: 38297220 PMCID: PMC10832249 DOI: 10.1186/s12859-024-05666-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 01/22/2024] [Indexed: 02/02/2024] Open
Abstract
BACKGROUND Metabolic pathway prediction is one possible approach to address the problem in system biology of reconstructing an organism's metabolic network from its genome sequence. Recently there have been developments in machine learning-based pathway prediction methods that conclude that machine learning-based approaches are similar in performance to the most used method, PathoLogic which is a rule-based method. One issue is that previous studies evaluated PathoLogic without taxonomic pruning which decreases its performance. RESULTS In this study, we update the evaluation results from previous studies to demonstrate that PathoLogic with taxonomic pruning outperforms previous machine learning-based approaches and that further improvements in performance need to be made for them to be competitive. Furthermore, we introduce mlXGPR, a XGBoost-based metabolic pathway prediction method based on the multi-label classification pathway prediction framework introduced from mlLGPR. We also improve on this multi-label framework by utilizing correlations between labels using classifier chains. We propose a ranking method that determines the order of the chain so that lower performing classifiers are placed later in the chain to utilize the correlations between labels more. We evaluate mlXGPR with and without classifier chains on single-organism and multi-organism benchmarks. Our results indicate that mlXGPR outperform other previous pathway prediction methods including PathoLogic with taxonomic pruning in terms of hamming loss, precision and F1 score on single organism benchmarks. CONCLUSIONS The results from our study indicate that the performance of machine learning-based pathway prediction methods can be substantially improved and can even outperform PathoLogic with taxonomic pruning.
Collapse
Affiliation(s)
- Hyunwhan Joe
- Biomedical Knowledge Engineering Lab., Seoul National University, Seoul, Republic of Korea
| | - Hong-Gee Kim
- Biomedical Knowledge Engineering Lab., Seoul National University, Seoul, Republic of Korea.
- School of Dentistry and Dental Research Institute, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
17
|
Moseley H. In the AI science boom, beware: your results are only as good as your data. Nature 2024:10.1038/d41586-024-00306-2. [PMID: 38302705 DOI: 10.1038/d41586-024-00306-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
|
18
|
Liu Y, Jiang Y, Zhang F, Yang Y. A Novel Multi-Scale Graph Neural Network for Metabolic Pathway Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:178-187. [PMID: 38127612 DOI: 10.1109/tcbb.2023.3345647] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Predicting the metabolic pathway classes of compounds in the human body is an important problem in drug research and development. For this purpose, we propose a Multi-Scale Graph Neural Network framework, named MSGNN. The framework includes a subgraph encoder, a feature encoder and a global feature processor, and a graph augmentation strategy is adopted. The subgraph encoder is responsible for extracting the local structural features of the compound, the feature encoder learns the characteristics of the atoms, and the global feature processor processes the information from the pre-training model and the two molecular fingerprints, while the graph augmentation strategy is to expand the train set through a scientific and reasonable method. The experiment result illustrates that the accuracy, precision, recall and F1 metrics of MSGNN reach 98.17%, 94.18%, 94.43% and 94.30%, respectively, which is superior to the similar models we have known. In addition, the ablation experiment demonstrates the indispensability of MSGNN modules.
Collapse
|
19
|
Huckvale ED, Powell CD, Jin H, Moseley HNB. Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites. Metabolites 2023; 13:1120. [PMID: 37999216 PMCID: PMC10673125 DOI: 10.3390/metabo13111120] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 10/25/2023] [Accepted: 10/30/2023] [Indexed: 11/25/2023] Open
Abstract
Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.
Collapse
Affiliation(s)
- Erik D. Huckvale
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
| | - Christian D. Powell
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
- Department of Computer Science (Data Science Program), University of Kentucky, Lexington, KY 40506, USA
| | - Huan Jin
- Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA
| | - Hunter N. B. Moseley
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
- Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, KY 40506, USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY 40506, USA
| |
Collapse
|
20
|
Ryu G, Kim GB, Yu T, Lee SY. Deep learning for metabolic pathway design. Metab Eng 2023; 80:130-141. [PMID: 37734652 DOI: 10.1016/j.ymben.2023.09.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 09/17/2023] [Accepted: 09/19/2023] [Indexed: 09/23/2023]
Abstract
The establishment of a bio-based circular economy is imperative in tackling the climate crisis and advancing sustainable development. In this realm, the creation of microbial cell factories is central to generating a variety of chemicals and materials. The design of metabolic pathways is crucial in shaping these microbial cell factories, especially when it comes to producing chemicals with yet-to-be-discovered biosynthetic routes. To aid in navigating the complexities of chemical and metabolic domains, computer-supported tools for metabolic pathway design have emerged. In this paper, we evaluate how digital strategies can be employed for pathway prediction and enzyme discovery. Additionally, we touch upon the recent strides made in using deep learning techniques for metabolic pathway prediction. These computational tools and strategies streamline the design of metabolic pathways, facilitating the development of microbial cell factories. Leveraging the capabilities of deep learning in metabolic pathway design is profoundly promising, potentially hastening the advent of a bio-based circular economy.
Collapse
Affiliation(s)
- Gahyeon Ryu
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, 34141, Republic of Korea
| | - Gi Bae Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, 34141, Republic of Korea
| | - Taeho Yu
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, 34141, Republic of Korea
| | - Sang Yup Lee
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, 34141, Republic of Korea; BioProcess Engineering Research Center and BioInformatics Research Center, KAIST, Daejeon, 34141, Republic of Korea; Graduate School of Engineering Biology, KAIST, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
21
|
Huckvale ED, Powell CD, Jin H, Moseley HN. Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.03.560715. [PMID: 37873272 PMCID: PMC10592640 DOI: 10.1101/2023.10.03.560715] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1-score of 0.8180 and Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.
Collapse
Affiliation(s)
- Erik D. Huckvale
- Department of Computer Science (Data Science Program), University of Kentucky, Lexington, KY 40506, USA
| | - Christian D. Powell
- Department of Computer Science (Data Science Program), University of Kentucky, Lexington, KY 40506, USA
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
| | - Huan Jin
- Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA
| | - Hunter N.B. Moseley
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, KY 40506, USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY 40506, USA
| |
Collapse
|
22
|
Liu X, Yang H, Ai C, Ding Y, Guo F, Tang J. MVML-MPI: Multi-View Multi-Label Learning for Metabolic Pathway Inference. Brief Bioinform 2023; 24:bbad393. [PMID: 37930024 DOI: 10.1093/bib/bbad393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 09/20/2023] [Accepted: 10/11/2023] [Indexed: 11/07/2023] Open
Abstract
Development of robust and effective strategies for synthesizing new compounds, drug targeting and constructing GEnome-scale Metabolic models (GEMs) requires a deep understanding of the underlying biological processes. A critical step in achieving this goal is accurately identifying the categories of pathways in which a compound participated. However, current machine learning-based methods often overlook the multifaceted nature of compounds, resulting in inaccurate pathway predictions. Therefore, we present a novel framework on Multi-View Multi-Label Learning for Metabolic Pathway Inference, hereby named MVML-MPI. First, MVML-MPI learns the distinct compound representations in parallel with corresponding compound encoders to fully extract features. Subsequently, we propose an attention-based mechanism that offers a fusion module to complement these multi-view representations. As a result, MVML-MPI accurately represents and effectively captures the complex relationship between compounds and metabolic pathways and distinguishes itself from current machine learning-based methods. In experiments conducted on the Kyoto Encyclopedia of Genes and Genomes pathways dataset, MVML-MPI outperformed state-of-the-art methods, demonstrating the superiority of MVML-MPI and its potential to utilize the field of metabolic pathway design, which can aid in optimizing drug-like compounds and facilitating the development of GEMs. The code and data underlying this article are freely available at https://github.com/guofei-tju/MVML-MPI. Contact: jtang@cse.sc.edu, guofei@csu.edu.com or wuxi_dyj@csj.uestc.edu.cn.
Collapse
Affiliation(s)
- Xiaoyi Liu
- Computer Science and Engineering, University of South Carolina, Columbia 29208, USA
| | - Hongpeng Yang
- Computer Science and Engineering, University of South Carolina, Columbia 29208, USA
| | - Chengwei Ai
- Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Fei Guo
- Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jijun Tang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Nanshan 518055, China
| |
Collapse
|
23
|
Bao H, Zhao J, Zhao X, Zhao C, Lu X, Xu G. Prediction of plant secondary metabolic pathways using deep transfer learning. BMC Bioinformatics 2023; 24:348. [PMID: 37726702 PMCID: PMC10507959 DOI: 10.1186/s12859-023-05485-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Accepted: 09/14/2023] [Indexed: 09/21/2023] Open
Abstract
BACKGROUND Plant secondary metabolites are highly valued for their applications in pharmaceuticals, nutrition, flavors, and aesthetics. It is of great importance to elucidate plant secondary metabolic pathways due to their crucial roles in biological processes during plant growth and development. However, understanding plant biosynthesis and degradation pathways remains a challenge due to the lack of sufficient information in current databases. To address this issue, we proposed a transfer learning approach using a pre-trained hybrid deep learning architecture that combines Graph Transformer and convolutional neural network (GTC) to predict plant metabolic pathways. RESULTS GTC provides comprehensive molecular representation by extracting both structural features from the molecular graph and textual information from the SMILES string. GTC is pre-trained on the KEGG datasets to acquire general features, followed by fine-tuning on plant-derived datasets. Four metrics were chosen for model performance evaluation. The results show that GTC outperforms six other models, including three previously reported machine learning models, on the KEGG dataset. GTC yields an accuracy of 96.75%, precision of 85.14%, recall of 83.03%, and F1_score of 84.06%. Furthermore, an ablation study confirms the indispensability of all the components of the hybrid GTC model. Transfer learning is then employed to leverage the shared knowledge acquired from the KEGG metabolic pathways. As a result, the transferred GTC exhibits outstanding accuracy in predicting plant secondary metabolic pathways with an average accuracy of 98.30% in fivefold cross-validation and 97.82% on the final test. In addition, GTC is employed to classify natural products. It achieves a perfect accuracy score of 100.00% for alkaloids, while the lowest accuracy score of 98.42% for shikimates and phenylpropanoids. CONCLUSIONS The proposed GTC effectively captures molecular features, and achieves high performance in classifying KEGG metabolic pathways and predicting plant secondary metabolic pathways via transfer learning. Furthermore, GTC demonstrates its generalization ability by accurately classifying natural products. A user-friendly executable program has been developed, which only requires the input of the SMILES string of the query compound in a graphical interface.
Collapse
Affiliation(s)
- Han Bao
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, People's Republic of China
- University of Chinese Academy of Sciences, Beijing, 100049, People's Republic of China
- Liaoning Province Key Laboratory of Metabolomics, Dalian, 116023, People's Republic of China
| | - Jinhui Zhao
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, People's Republic of China
- University of Chinese Academy of Sciences, Beijing, 100049, People's Republic of China
- Liaoning Province Key Laboratory of Metabolomics, Dalian, 116023, People's Republic of China
| | - Xinjie Zhao
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, People's Republic of China
- University of Chinese Academy of Sciences, Beijing, 100049, People's Republic of China
- Liaoning Province Key Laboratory of Metabolomics, Dalian, 116023, People's Republic of China
| | - Chunxia Zhao
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, People's Republic of China
- University of Chinese Academy of Sciences, Beijing, 100049, People's Republic of China
- Liaoning Province Key Laboratory of Metabolomics, Dalian, 116023, People's Republic of China
| | - Xin Lu
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, People's Republic of China.
- University of Chinese Academy of Sciences, Beijing, 100049, People's Republic of China.
- Liaoning Province Key Laboratory of Metabolomics, Dalian, 116023, People's Republic of China.
| | - Guowang Xu
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, People's Republic of China.
- University of Chinese Academy of Sciences, Beijing, 100049, People's Republic of China.
- Liaoning Province Key Laboratory of Metabolomics, Dalian, 116023, People's Republic of China.
| |
Collapse
|
24
|
Bi X, Cheng Y, Xu X, Lv X, Liu Y, Li J, Du G, Chen J, Ledesma-Amaro R, Liu L. etiBsu1209: A comprehensive multiscale metabolic model for Bacillus subtilis. Biotechnol Bioeng 2023; 120:1623-1639. [PMID: 36788025 DOI: 10.1002/bit.28355] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 12/08/2022] [Accepted: 02/13/2023] [Indexed: 02/16/2023]
Abstract
Genome-scale metabolic models (GEMs) have been widely used to guide the computational design of microbial cell factories, and to date, seven GEMs have been reported for Bacillus subtilis, a model gram-positive microorganism widely used in bioproduction of functional nutraceuticals and food ingredients. However, none of them are widely used because they often lead to erroneous predictions due to their low predictive power and lack of information on regulatory mechanisms. In this work, we constructed a new version of GEM for B. subtilis (iBsu1209), which contains 1209 genes, 1595 metabolites, and 1948 reactions. We applied machine learning to fill gaps, which formed a relatively complete metabolic network able to predict with high accuracy (89.3%) the growth of 1209 mutants under 12 different culture conditions. In addition, we developed a visualization and code-free software, Model Tool, for multiconstraints model reconstruction and analysis. We used this software to construct etiBsu1209, a multiscale model that integrates enzymatic constraints, thermodynamic constraints, and transcriptional regulatory networks. Furthermore, we used etiBsu1209 to guide a metabolic engineering strategy (knocking out fabI and yfkN genes) for the overproduction of nutraceutical menaquinone-7, and the titer increased to 153.94 mg/L, 2.2-times that of the parental strain. To the best of our knowledge, etiBsu1209 is the first comprehensive multiscale model for B. subtilis and can serve as a solid basis for rational computational design of B. subtilis cell factories for bioproduction.
Collapse
Affiliation(s)
- Xinyu Bi
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi, China.,Science Center for Future Foods, Ministry of Education, Jiangnan University, Wuxi, China
| | - Yang Cheng
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi, China.,Science Center for Future Foods, Ministry of Education, Jiangnan University, Wuxi, China
| | - Xianhao Xu
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi, China.,Science Center for Future Foods, Ministry of Education, Jiangnan University, Wuxi, China
| | - Xueqin Lv
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi, China.,Science Center for Future Foods, Ministry of Education, Jiangnan University, Wuxi, China
| | - Yanfeng Liu
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi, China.,Science Center for Future Foods, Ministry of Education, Jiangnan University, Wuxi, China
| | - Jianghua Li
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi, China.,Science Center for Future Foods, Ministry of Education, Jiangnan University, Wuxi, China
| | - Guocheng Du
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi, China.,Science Center for Future Foods, Ministry of Education, Jiangnan University, Wuxi, China
| | - Jian Chen
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi, China.,Science Center for Future Foods, Ministry of Education, Jiangnan University, Wuxi, China
| | | | - Long Liu
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi, China.,Science Center for Future Foods, Ministry of Education, Jiangnan University, Wuxi, China
| |
Collapse
|
25
|
Chou WC, Lin Z. Machine learning and artificial intelligence in physiologically based pharmacokinetic modeling. Toxicol Sci 2023; 191:1-14. [PMID: 36156156 PMCID: PMC9887681 DOI: 10.1093/toxsci/kfac101] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Physiologically based pharmacokinetic (PBPK) models are useful tools in drug development and risk assessment of environmental chemicals. PBPK model development requires the collection of species-specific physiological, and chemical-specific absorption, distribution, metabolism, and excretion (ADME) parameters, which can be a time-consuming and expensive process. This raises a need to create computational models capable of predicting input parameter values for PBPK models, especially for new compounds. In this review, we summarize an emerging paradigm for integrating PBPK modeling with machine learning (ML) or artificial intelligence (AI)-based computational methods. This paradigm includes 3 steps (1) obtain time-concentration PK data and/or ADME parameters from publicly available databases, (2) develop ML/AI-based approaches to predict ADME parameters, and (3) incorporate the ML/AI models into PBPK models to predict PK summary statistics (eg, area under the curve and maximum plasma concentration). We also discuss a neural network architecture "neural ordinary differential equation (Neural-ODE)" that is capable of providing better predictive capabilities than other ML methods when used to directly predict time-series PK profiles. In order to support applications of ML/AI methods for PBPK model development, several challenges should be addressed (1) as more data become available, it is important to expand the training set by including the structural diversity of compounds to improve the prediction accuracy of ML/AI models; (2) due to the black box nature of many ML models, lack of sufficient interpretability is a limitation; (3) Neural-ODE has great potential to be used to generate time-series PK profiles for new compounds with limited ADME information, but its application remains to be explored. Despite existing challenges, ML/AI approaches will continue to facilitate the efficient development of robust PBPK models for a large number of chemicals.
Collapse
Affiliation(s)
- Wei-Chun Chou
- Department of Environmental and Global Health, College of Public Health and Health Professions, University of Florida, Gainesville, FL 32610, USA
- Center for Environmental and Human Toxicology, University of Florida, Gainesville, FL 32608, USA
| | - Zhoumeng Lin
- Department of Environmental and Global Health, College of Public Health and Health Professions, University of Florida, Gainesville, FL 32610, USA
- Center for Environmental and Human Toxicology, University of Florida, Gainesville, FL 32608, USA
| |
Collapse
|
26
|
Lim PK, Julca I, Mutwil M. Redesigning plant specialized metabolism with supervised machine learning using publicly available reactome data. Comput Struct Biotechnol J 2023; 21:1639-1650. [PMID: 36874159 PMCID: PMC9976193 DOI: 10.1016/j.csbj.2023.01.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 01/12/2023] [Accepted: 01/12/2023] [Indexed: 01/19/2023] Open
Abstract
The immense structural diversity of products and intermediates of plant specialized metabolism (specialized metabolites) makes them rich sources of therapeutic medicine, nutrients, and other useful materials. With the rapid accumulation of reactome data that can be accessible on biological and chemical databases, along with recent advances in machine learning, this review sets out to outline how supervised machine learning can be used to design new compounds and pathways by exploiting the wealth of said data. We will first examine the various sources from which reactome data can be obtained, followed by explaining the different machine learning encoding methods for reactome data. We then discuss current supervised machine learning developments that can be employed in various aspects to help redesign plant specialized metabolism.
Collapse
Affiliation(s)
- Peng Ken Lim
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Irene Julca
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
27
|
Patra P, B R D, Kundu P, Das M, Ghosh A. Recent advances in machine learning applications in metabolic engineering. Biotechnol Adv 2023; 62:108069. [PMID: 36442697 DOI: 10.1016/j.biotechadv.2022.108069] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2022] [Revised: 10/18/2022] [Accepted: 11/22/2022] [Indexed: 11/27/2022]
Abstract
Metabolic engineering encompasses several widely-used strategies, which currently hold a high seat in the field of biotechnology when its potential is manifesting through a plethora of research and commercial products with a strong societal impact. The genomic revolution that occurred almost three decades ago has initiated the generation of large omics-datasets which has helped in gaining a better understanding of cellular behavior. The itinerary of metabolic engineering that has occurred based on these large datasets has allowed researchers to gain detailed insights and a reasonable understanding of the intricacies of biosystems. However, the existing trail-and-error approaches for metabolic engineering are laborious and time-intensive when it comes to the production of target compounds with high yields through genetic manipulations in host organisms. Machine learning (ML) coupled with the available metabolic engineering test instances and omics data brings a comprehensive and multidisciplinary approach that enables scientists to evaluate various parameters for effective strain design. This vast amount of biological data should be standardized through knowledge engineering to train different ML models for providing accurate predictions in gene circuits designing, modification of proteins, optimization of bioprocess parameters for scaling up, and screening of hyper-producing robust cell factories. This review briefs on the premise of ML, followed by mentioning various ML methods and algorithms alongside the numerous omics datasets available to train ML models for predicting metabolic outcomes with high-accuracy. The combinative interplay between the ML algorithms and biological datasets through knowledge engineering have guided the recent advancements in applications such as CRISPR/Cas systems, gene circuits, protein engineering, metabolic pathway reconstruction, and bioprocess engineering. Finally, this review addresses the probable challenges of applying ML in metabolic engineering which will guide the researchers toward novel techniques to overcome the limitations.
Collapse
Affiliation(s)
- Pradipta Patra
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Disha B R
- B.M.S College of Engineering, Basavanagudi, Bengaluru, Karnataka 560019, India
| | - Pritam Kundu
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Manali Das
- School of Bioscience, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Amit Ghosh
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India; P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India.
| |
Collapse
|
28
|
Bonetta Valentino R, Ebejer JP, Valentino G. Machine Learning Using Neural Networks for Metabolomic Pathway Analyses. Methods Mol Biol 2023; 2553:395-415. [PMID: 36227552 DOI: 10.1007/978-1-0716-2617-7_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Elucidating the mechanisms of metabolic pathways helps us understand the cascade of enzyme-catalyzed reactions that lead to the conversion of substances into final products. This has implications for predicting how newly synthesized compounds will affect a person's metabolism and, hence, the development of novel treatments to improve one's health. The study of metabolomic pathways, together with protein engineering, may also aid in the extraction, at a scale, of natural products to be used as drugs and drug precursors. Several approaches have been used to correlate protein annotations to metabolic pathways in order to derive pathways directly related to specific organisms. These could range from association rule-mining techniques to machine learning methods such as decision trees, naïve Bayes, logistic regression, and ensemble methods.In this chapter, we will be reviewing the use of machine learning for metabolic pathway analyses, with a step-by-step focus on the use of deep learning to predict the association of compounds (metabolites) to their respective metabolomic pathway classes. This prediction could help explain interactions of small molecules in organisms. Inspired by the work of Baranwal et al. (2019), we demonstrate how to build and train a deep learning neural network model to perform a multi-label prediction. We considered two different types of fingerprints as features (inputs to the model). The output of the model is the set of metabolic pathway classes (from the KEGG dataset) in which the input molecule participates. We will walk through the various steps of this process, including data collection, feature engineering, model selection, training, and evaluation. This model-building and evaluation process may be easily transferred to other domains of interest. All the source code used in this chapter is made publicly available at https://github.com/jp-um/machine_learning_for_metabolomic_pathway_analyses .
Collapse
Affiliation(s)
- Rosalin Bonetta Valentino
- Barts and the London School of Medicine and Dentistry, Queen Mary University of London, Victoria, Malta.
| | - Jean-Paul Ebejer
- Centre for Molecular Medicine and Biobanking, University of Malta, Msida, Malta
| | - Gianluca Valentino
- Department of Communications and Computer Engineering, University of Malta, Msida, Malta
| |
Collapse
|
29
|
Li H, Wang D, Zhou X, Ding S, Guo W, Zhang S, Li Z, Huang T, Cai YD. Characterization of spleen and lymph node cell types via CITE-seq and machine learning methods. Front Mol Neurosci 2022; 15:1033159. [PMID: 36311013 PMCID: PMC9608858 DOI: 10.3389/fnmol.2022.1033159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 09/26/2022] [Indexed: 11/13/2022] Open
Abstract
The spleen and lymph nodes are important functional organs for human immune system. The identification of cell types for spleen and lymph nodes is helpful for understanding the mechanism of immune system. However, the cell types of spleen and lymph are highly diverse in the human body. Therefore, in this study, we employed a series of machine learning algorithms to computationally analyze the cell types of spleen and lymph based on single-cell CITE-seq sequencing data. A total of 28,211 cell data (training vs. test = 14,435 vs. 13,776) involving 24 cell types were collected for this study. For the training dataset, it was analyzed by Boruta and minimum redundancy maximum relevance (mRMR) one by one, resulting in an mRMR feature list. This list was fed into the incremental feature selection (IFS) method, incorporating four classification algorithms (deep forest, random forest, K-nearest neighbor, and decision tree). Some essential features were discovered and the deep forest with its optimal features achieved the best performance. A group of related proteins (CD4, TCRb, CD103, CD43, and CD23) and genes (Nkg7 and Thy1) contributing to the classification of spleen and lymph nodes cell types were analyzed. Furthermore, the classification rules yielded by decision tree were also provided and analyzed. Above findings may provide helpful information for deepening our understanding on the diversity of cell types.
Collapse
Affiliation(s)
- Hao Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Deling Wang
- State Key Laboratory of Oncology in South China, Department of Radiology, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Xianchao Zhou
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Shijian Ding
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Institutes for Biological Sciences (SIBS), Shanghai Jiao Tong University School of Medicine (SJTUSM), Chinese Academy of Sciences (CAS), Shanghai, China
| | - Shiqi Zhang
- Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Tao Huang
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- *Correspondence: Tao Huang,
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- Yu-Dong Cai,
| |
Collapse
|
30
|
Yang Z, Liu J, Shah HA, Feng J. A novel hybrid framework for metabolic pathways prediction based on the graph attention network. BMC Bioinformatics 2022; 23:329. [PMID: 36171550 PMCID: PMC9520805 DOI: 10.1186/s12859-022-04856-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Accepted: 07/25/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Making clear what kinds of metabolic pathways a drug compound involves in can help researchers understand how the drug is absorbed, distributed, metabolized, and excreted. The characteristics of a compound such as structure, composition and so on directly determine the metabolic pathways it participates in. METHODS We developed a novel hybrid framework based on the graph attention network (GAT) to predict the metabolic pathway classes that a compound involves in, named HFGAT, by making use of its global and local characteristics. The framework mainly consists of a two-branch feature extracting layer and a fully connected (FC) layer. In the two-branch feature extracting layer, one branch is responsible to extract global features of the compound; and the other branch introduces a GAT consisting of two graph attention layers to extract local structural features of the compound. Both the global and the local features of the compound are then integrated into the FC layer which outputs the predicted result of metabolic pathway categories that the compound belongs to. RESULTS We compared the multi-class classification performance of HFGAT with six other representative methods, including five classic machine learning methods and one graph convolutional network (GCN) based deep learning method, on the benchmark dataset containing 6999 compounds belonging to 11 pathway categories. The results showed that the deep learning-based methods (HFGAT, GCN-based method) outperformed the traditional machine learning methods in the prediction of metabolic pathways and our proposed HFGAT method performed better than the GCN-based method. Moreover, HFGAT achieved higher [Formula: see text] scores on 8 of 11 classes than the GCN-based method. CONCLUSIONS Our proposed HFGAT makes use of both the global and local information of the compounds to predict their metabolic pathway categories and has achieved a significant performance. Compared with the GCN model, the introduction of the GAT can help our model pay more attention to substructures of the compound that are useful for the prediction task. The study provided a potential method for drug discovery with all types of metabolic reactions that may be involved in the decomposition and synthesis of pharmaceutical compounds in the organism.
Collapse
Affiliation(s)
- Zhihui Yang
- School of Computer Science, Wuhan University, Luojia Hill Street, Wuhan, 430072, China
| | - Juan Liu
- School of Computer Science, Wuhan University, Luojia Hill Street, Wuhan, 430072, China. .,Institute of Artificial Intelligence, Wuhan University, Luojia Hill Street, Wuhan, 430072, China. .,National Engineering Research Center for Multimedia Software, Luojia Hill Street, Wuhan, 430072, China.
| | - Hayat Ali Shah
- School of Computer Science, Wuhan University, Luojia Hill Street, Wuhan, 430072, China
| | - Jing Feng
- School of Computer Science, Wuhan University, Luojia Hill Street, Wuhan, 430072, China.,Institute of Artificial Intelligence, Wuhan University, Luojia Hill Street, Wuhan, 430072, China.,National Engineering Research Center for Multimedia Software, Luojia Hill Street, Wuhan, 430072, China
| |
Collapse
|
31
|
Baranwal M, Magner A, Saldinger J, Turali-Emre ES, Elvati P, Kozarekar S, VanEpps JS, Kotov NA, Violi A, Hero AO. Struct2Graph: a graph attention network for structure based predictions of protein-protein interactions. BMC Bioinformatics 2022; 23:370. [PMID: 36088285 PMCID: PMC9464414 DOI: 10.1186/s12859-022-04910-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 08/26/2022] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Development of new methods for analysis of protein-protein interactions (PPIs) at molecular and nanometer scales gives insights into intracellular signaling pathways and will improve understanding of protein functions, as well as other nanoscale structures of biological and abiological origins. Recent advances in computational tools, particularly the ones involving modern deep learning algorithms, have been shown to complement experimental approaches for describing and rationalizing PPIs. However, most of the existing works on PPI predictions use protein-sequence information, and thus have difficulties in accounting for the three-dimensional organization of the protein chains. RESULTS In this study, we address this problem and describe a PPI analysis based on a graph attention network, named Struct2Graph, for identifying PPIs directly from the structural data of folded protein globules. Our method is capable of predicting the PPI with an accuracy of 98.89% on the balanced set consisting of an equal number of positive and negative pairs. On the unbalanced set with the ratio of 1:10 between positive and negative pairs, Struct2Graph achieves a fivefold cross validation average accuracy of 99.42%. Moreover, Struct2Graph can potentially identify residues that likely contribute to the formation of the protein-protein complex. The identification of important residues is tested for two different interaction types: (a) Proteins with multiple ligands competing for the same binding area, (b) Dynamic protein-protein adhesion interaction. Struct2Graph identifies interacting residues with 30% sensitivity, 89% specificity, and 87% accuracy. CONCLUSIONS In this manuscript, we address the problem of prediction of PPIs using a first of its kind, 3D-structure-based graph attention network (code available at https://github.com/baranwa2/Struct2Graph ). Furthermore, the novel mutual attention mechanism provides insights into likely interaction sites through its unsupervised knowledge selection process. This study demonstrates that a relatively low-dimensional feature embedding learned from graph structures of individual proteins outperforms other modern machine learning classifiers based on global protein features. In addition, through the analysis of single amino acid variations, the attention mechanism shows preference for disease-causing residue variations over benign polymorphisms, demonstrating that it is not limited to interface residues.
Collapse
Affiliation(s)
- Mayank Baranwal
- Division of Data and Decision Sciences, Tata Consultancy Services Research, Mumbai, India
- Systems and Control Engineering Group, Indian Institute of Technology, Bombay, India
| | - Abram Magner
- Department of Computer Science, University of Albany, SUNY, Albany, USA
| | - Jacob Saldinger
- Department of Chemical Engineering, University of Michigan, Ann Arbor, USA
| | | | - Paolo Elvati
- Department of Mechanical Engineering, University of Michigan, Ann Arbor, USA
| | - Shivani Kozarekar
- Department of Chemical Engineering, University of Michigan, Ann Arbor, USA
| | - J. Scott VanEpps
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, USA
- Department of Emergency Medicine, University of Michigan, Ann Arbor, USA
- Biointerfaces Institute, University of Michigan, Ann Arbor, USA
| | - Nicholas A. Kotov
- Department of Chemical Engineering, University of Michigan, Ann Arbor, USA
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, USA
- Biointerfaces Institute, University of Michigan, Ann Arbor, USA
- Department of Materials Science and Engineering, University of Michigan, Ann Arbor, USA
| | - Angela Violi
- Department of Chemical Engineering, University of Michigan, Ann Arbor, USA
- Department of Mechanical Engineering, University of Michigan, Ann Arbor, USA
- Biophysics Program, University of Michigan, Ann Arbor, USA
| | - Alfred O. Hero
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, USA
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA
- Department of Statistics, University of Michigan, Ann Arbor, USA
- Program in Applied Interdisciplinary Mathematics, University of Michigan, Ann Arbor, USA
- Program in Bioinformatics, University of Michigan, Ann Arbor, USA
| |
Collapse
|
32
|
Li H, Huang F, Liao H, Li Z, Feng K, Huang T, Cai YD. Identification of COVID-19-Specific Immune Markers Using a Machine Learning Method. Front Mol Biosci 2022; 9:952626. [PMID: 35928229 PMCID: PMC9344575 DOI: 10.3389/fmolb.2022.952626] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Accepted: 06/21/2022] [Indexed: 01/08/2023] Open
Abstract
Notably, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a tight relationship with the immune system. Human resistance to COVID-19 infection comprises two stages. The first stage is immune defense, while the second stage is extensive inflammation. This process is further divided into innate and adaptive immunity during the immune defense phase. These two stages involve various immune cells, including CD4+ T cells, CD8+ T cells, monocytes, dendritic cells, B cells, and natural killer cells. Various immune cells are involved and make up the complex and unique immune system response to COVID-19, providing characteristics that set it apart from other respiratory infectious diseases. In the present study, we identified cell markers for differentiating COVID-19 from common inflammatory responses, non-COVID-19 severe respiratory diseases, and healthy populations based on single-cell profiling of the gene expression of six immune cell types by using Boruta and mRMR feature selection methods. Some features such as IFI44L in B cells, S100A8 in monocytes, and NCR2 in natural killer cells are involved in the innate immune response of COVID-19. Other features such as ZFP36L2 in CD4+ T cells can regulate the inflammatory process of COVID-19. Subsequently, the IFS method was used to determine the best feature subsets and classifiers in the six immune cell types for two classification algorithms. Furthermore, we established the quantitative rules used to distinguish the disease status. The results of this study can provide theoretical support for a more in-depth investigation of COVID-19 pathogenesis and intervention strategies.
Collapse
Affiliation(s)
- Hao Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Feiming Huang
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Huiping Liao
- Ophthalmology and Optometry Medical School, Shandong University of Traditional Chinese Medicine, Jinan, China
| | - Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- *Correspondence: Tao Huang, ; Yu-Dong Cai,
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- *Correspondence: Tao Huang, ; Yu-Dong Cai,
| |
Collapse
|
33
|
Du BX, Zhao PC, Zhu B, Yiu SM, Nyamabo AK, Yu H, Shi JY. MLGL-MP: a Multi-Label Graph Learning framework enhanced by pathway interdependence for Metabolic Pathway prediction. Bioinformatics 2022; 38:i325-i332. [PMID: 35758801 PMCID: PMC9235472 DOI: 10.1093/bioinformatics/btac222] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Motivation During lead compound optimization, it is crucial to identify pathways where a drug-like compound is metabolized. Recently, machine learning-based methods have achieved inspiring progress to predict potential metabolic pathways for drug-like compounds. However, they neglect the knowledge that metabolic pathways are dependent on each other. Moreover, they are inadequate to elucidate why compounds participate in specific pathways. Results To address these issues, we propose a novel Multi-Label Graph Learning framework of Metabolic Pathway prediction boosted by pathway interdependence, called MLGL-MP, which contains a compound encoder, a pathway encoder and a multi-label predictor. The compound encoder learns compound embedding representations by graph neural networks. After constructing a pathway dependence graph by re-trained word embeddings and pathway co-occurrences, the pathway encoder learns pathway embeddings by graph convolutional networks. Moreover, after adapting the compound embedding space into the pathway embedding space, the multi-label predictor measures the proximity of two spaces to discriminate which pathways a compound participates in. The comparison with state-of-the-art methods on KEGG pathways demonstrates the superiority of our MLGL-MP. Also, the ablation studies reveal how its three components contribute to the model, including the pathway dependence, the adapter between compound embeddings and pathway embeddings, as well as the pre-training strategy. Furthermore, a case study illustrates the interpretability of MLGL-MP by indicating crucial substructures in a compound, which are significantly associated with the attending metabolic pathways. It is anticipated that this work can boost metabolic pathway predictions in drug discovery. Availability and implementation The code and data underlying this article are freely available at https://github.com/dubingxue/MLGL-MP.
Collapse
Affiliation(s)
- Bing-Xue Du
- School of Life Sciences, Northwestern Polytechnical University, Xi'an 710072, China
| | - Peng-Cheng Zhao
- School of Life Sciences, Northwestern Polytechnical University, Xi'an 710072, China
| | - Bei Zhu
- School of Life Sciences, Northwestern Polytechnical University, Xi'an 710072, China
| | - Siu-Ming Yiu
- Department of Computer Science, The University of Hong Kong, Hong Kong 999077, China
| | - Arnold K Nyamabo
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China
| | - Hui Yu
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China
| | - Jian-Yu Shi
- School of Life Sciences, Northwestern Polytechnical University, Xi'an 710072, China
| |
Collapse
|
34
|
Shi Z, Liu P, Liao X, Mao Z, Zhang J, Wang Q, Sun J, Ma H, Ma Y. Data-Driven Synthetic Cell Factories Development for Industrial Biomanufacturing. BIODESIGN RESEARCH 2022; 2022:9898461. [PMID: 37850146 PMCID: PMC10521697 DOI: 10.34133/2022/9898461] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 05/26/2022] [Indexed: 10/19/2023] Open
Abstract
Revolutionary breakthroughs in artificial intelligence (AI) and machine learning (ML) have had a profound impact on a wide range of scientific disciplines, including the development of artificial cell factories for biomanufacturing. In this paper, we review the latest studies on the application of data-driven methods for the design of new proteins, pathways, and strains. We first briefly introduce the various types of data and databases relevant to industrial biomanufacturing, which are the basis for data-driven research. Different types of algorithms, including traditional ML and more recent deep learning methods, are also presented. We then demonstrate how these data-based approaches can be applied to address various issues in cell factory development using examples from recent studies, including the prediction of protein function, improvement of metabolic models, and estimation of missing kinetic parameters, design of non-natural biosynthesis pathways, and pathway optimization. In the last section, we discuss the current limitations of these data-driven approaches and propose that data-driven methods should be integrated with mechanistic models to complement each other and facilitate the development of synthetic strains for industrial biomanufacturing.
Collapse
Affiliation(s)
- Zhenkun Shi
- Key Laboratory of Systems Microbial Technology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308China
| | - Pi Liu
- Key Laboratory of Systems Microbial Technology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308China
| | - Xiaoping Liao
- Key Laboratory of Systems Microbial Technology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308China
| | - Zhitao Mao
- Key Laboratory of Systems Microbial Technology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308China
| | - Jianqi Zhang
- Key Laboratory of Systems Microbial Technology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308China
| | - Qinhong Wang
- Key Laboratory of Systems Microbial Technology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308China
| | - Jibin Sun
- Key Laboratory of Systems Microbial Technology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308China
| | - Hongwu Ma
- Key Laboratory of Systems Microbial Technology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308China
| | - Yanhe Ma
- Key Laboratory of Systems Microbial Technology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308China
| |
Collapse
|
35
|
Lo-Thong-Viramoutou O, Charton P, Cadet XF, Grondin-Perez B, Saavedra E, Damour C, Cadet F. Non-linearity of Metabolic Pathways Critically Influences the Choice of Machine Learning Model. Front Artif Intell 2022; 5:744755. [PMID: 35757298 PMCID: PMC9226554 DOI: 10.3389/frai.2022.744755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 04/29/2022] [Indexed: 11/13/2022] Open
Abstract
The use of machine learning (ML) in life sciences has gained wide interest over the past years, as it speeds up the development of high performing models. Important modeling tools in biology have proven their worth for pathway design, such as mechanistic models and metabolic networks, as they allow better understanding of mechanisms involved in the functioning of organisms. However, little has been done on the use of ML to model metabolic pathways, and the degree of non-linearity associated with them is not clear. Here, we report the construction of different metabolic pathways with several linear and non-linear ML models. Different types of data are used; they lead to the prediction of important biological data, such as pathway flux and final product concentration. A comparison reveals that the data features impact model performance and highlight the effectiveness of non-linear models (e.g., QRF: RMSE = 0.021 nmol·min-1 and R2 = 1 vs. Bayesian GLM: RMSE = 1.379 nmol·min-1 R2 = 0.823). It turns out that the greater the degree of non-linearity of the pathway, the better suited a non-linear model will be. Therefore, a decision-making support for pathway modeling is established. These findings generally support the hypothesis that non-linear aspects predominate within the metabolic pathways. This must be taken into account when devising possible applications of these pathways for the identification of biomarkers of diseases (e.g., infections, cancer, neurodegenerative diseases) or the optimization of industrial production processes.
Collapse
Affiliation(s)
- Ophélie Lo-Thong-Viramoutou
- University of Paris, BIGR—Biologie Intégrée du Globule Rouge, Inserm, UMR_S1134, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Laboratory DSIMB, UMR_S1134, BIGR, Inserm, Faculty of Sciences and Technology, University of La Reunion, Saint-Denis, France
| | - Philippe Charton
- University of Paris, BIGR—Biologie Intégrée du Globule Rouge, Inserm, UMR_S1134, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Laboratory DSIMB, UMR_S1134, BIGR, Inserm, Faculty of Sciences and Technology, University of La Reunion, Saint-Denis, France
| | | | - Brigitte Grondin-Perez
- EnergyLab, EA 4079, Faculty of Sciences and Technology, University of La Reunion, Saint-Denis, France
| | - Emma Saavedra
- Departamento de Bioquímica, Instituto Nacional de Cardiología Ignacio Chávez, Mexico City, Mexico
| | - Cédric Damour
- EnergyLab, EA 4079, Faculty of Sciences and Technology, University of La Reunion, Saint-Denis, France
| | - Frédéric Cadet
- University of Paris, BIGR—Biologie Intégrée du Globule Rouge, Inserm, UMR_S1134, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Laboratory DSIMB, UMR_S1134, BIGR, Inserm, Faculty of Sciences and Technology, University of La Reunion, Saint-Denis, France
| |
Collapse
|
36
|
Liao X, Ma H, Tang YJ. Artificial intelligence: a solution to involution of design–build–test–learn cycle. Curr Opin Biotechnol 2022; 75:102712. [DOI: 10.1016/j.copbio.2022.102712] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 02/05/2022] [Accepted: 03/01/2022] [Indexed: 01/08/2023]
|
37
|
Huang F, Chen L, Guo W, Zhou X, Feng K, Huang T, Cai Y. Identifying COVID-19 Severity-Related SARS-CoV-2 Mutation Using a Machine Learning Method. Life (Basel) 2022; 12:806. [PMID: 35743837 PMCID: PMC9225528 DOI: 10.3390/life12060806] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 05/22/2022] [Accepted: 05/25/2022] [Indexed: 12/22/2022] Open
Abstract
SARS-CoV-2 shows great evolutionary capacity through a high frequency of genomic variation during transmission. Evolved SARS-CoV-2 often demonstrates resistance to previous vaccines and can cause poor clinical status in patients. Mutations in the SARS-CoV-2 genome involve mutations in structural and nonstructural proteins, and some of these proteins such as spike proteins have been shown to be directly associated with the clinical status of patients with severe COVID-19 pneumonia. In this study, we collected genome-wide mutation information of virulent strains and the severity of COVID-19 pneumonia in patients varying depending on their clinical status. Important protein mutations and untranslated region mutations were extracted using machine learning methods. First, through Boruta and four ranking algorithms (least absolute shrinkage and selection operator, light gradient boosting machine, max-relevance and min-redundancy, and Monte Carlo feature selection), mutations that were highly correlated with the clinical status of the patients were screened out and sorted in four feature lists. Some mutations such as D614G and V1176F were shown to be associated with viral infectivity. Moreover, previously unreported mutations such as A320V of nsp14 and I164ILV of nsp14 were also identified, which suggests their potential roles. We then applied the incremental feature selection method to each feature list to construct efficient classifiers, which can be directly used to distinguish the clinical status of COVID-19 patients. Meanwhile, four sets of quantitative rules were set up, which can help us to more intuitively understand the role of each mutation in differentiating the clinical status of COVID-19 patients. Identified key mutations linked to virologic properties will help better understand the mechanisms of infection and will aid in the development of antiviral treatments.
Collapse
Affiliation(s)
- Feiming Huang
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China;
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) and Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200025, China;
| | - Xianchao Zhou
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine (SJTUSM), Shanghai 200025, China;
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510060, China;
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yudong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| |
Collapse
|
38
|
Ran B, Chen L, Li M, Han Y, Dai Q. Drug-Drug Interactions Prediction Using Fingerprint Only. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:7818480. [PMID: 35586666 PMCID: PMC9110191 DOI: 10.1155/2022/7818480] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Accepted: 04/21/2022] [Indexed: 12/27/2022]
Abstract
Combination drug therapy is an efficient way to treat complicated diseases. Drug-drug interaction (DDI) is an important research topic in this therapy as patient safety is a problem when two or more drugs are taken at the same time. Traditionally, in vitro experiments and clinical trials are common ways to determine DDIs. However, these methods cannot meet the requirements of large-scale tests. It is an alternative way to develop computational methods for predicting DDIs. Although several previous methods have been proposed, they always need several types of drug information, limiting their applications. In this study, we proposed a simple computational method to predict DDIs. In this method, drugs were represented by their fingerprint features, which are most widely used in investigating drug-related problems. These features were refined by three models, including addition, subtraction, and Hadamard models, to generate the representation of DDIs. The powerful classification algorithm, random forest, was picked up to build the classifier. The results of two types of tenfold cross-validation on the classifier indicated good performance for discovering novel DDIs among known drugs and acceptable performance for identifying DDIs between known drugs and unknown drugs or among unknown drugs. Although the classifier adopted a sample scheme to represent DDIs, it was still superior to other methods, which adopted features generated by some advanced computer algorithms. Furthermore, a user-friendly web-server, named DDIPF (http://106.14.164.77:5004/DDIPF/), was developed to implement the classifier.
Collapse
Affiliation(s)
- Bing Ran
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Meijing Li
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Yujuan Han
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| |
Collapse
|
39
|
Li Z, Guo W, Ding S, Feng K, Lu L, Huang T, Cai Y. Detecting Blood Methylation Signatures in Response to Childhood Cancer Radiotherapy via Machine Learning Methods. BIOLOGY 2022; 11:biology11040607. [PMID: 35453806 PMCID: PMC9030135 DOI: 10.3390/biology11040607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 04/09/2022] [Accepted: 04/14/2022] [Indexed: 11/16/2022]
Abstract
Radiotherapy is a helpful treatment for cancer, but it can also potentially cause changes in many molecules, resulting in adverse effects. Among these changes, the occurrence of abnormal DNA methylation patterns has alarmed scientists. To explore the influence of region-specific radiotherapy on blood DNA methylation, we designed a computational workflow by using machine learning methods that can identify crucial methylation alterations related to treatment exposure. Irrelevant methylation features from the DNA methylation profiles of 2052 childhood cancer survivors were excluded via the Boruta method, and the remaining features were ranked using the minimum redundancy maximum relevance method to generate feature lists. These feature lists were then fed into the incremental feature selection method, which uses a combination of deep forest, k-nearest neighbor, random forest, and decision tree to find the most important methylation signatures and build the best classifiers and classification rules. Several methylation signatures and rules have been discovered and confirmed, allowing for a better understanding of methylation patterns in response to different treatment exposures.
Collapse
Affiliation(s)
- Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun 130052, China;
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200025, China;
| | - Shijian Ding
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China;
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York, NY 10032, USA
- Correspondence: (L.L.); (T.H.); or (Y.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.C.)
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- Correspondence: (L.L.); (T.H.); or (Y.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.C.)
| | - Yudong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
- Correspondence: (L.L.); (T.H.); or (Y.C.); Tel.: +86-21-54923269 (T.H.); +86-21-66136132 (Y.C.)
| |
Collapse
|
40
|
Wang P, Schumacher AM, Shiu SH. Computational prediction of plant metabolic pathways. CURRENT OPINION IN PLANT BIOLOGY 2022; 66:102171. [PMID: 35078130 DOI: 10.1016/j.pbi.2021.102171] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Revised: 12/07/2021] [Accepted: 12/18/2021] [Indexed: 06/14/2023]
Abstract
Uncovering genes encoding enzymes responsible for the biosynthesis of diverse plant metabolites is essential for metabolic engineering and production of plant metabolite-derived medicine. With the availability of multi-omics data for an ever-increasing number of plant species and the development of computational approaches, the metabolic pathways of many important plant compounds can be predicted, complementing a more traditional genetic and/or biochemical approach. Here, we summarize recent progress in predicting plant metabolic pathways using genome, transcriptome, proteome, interactome, and/or metabolome data, and the utility of integrating these data with machine learning to further improve metabolic pathway predictions.
Collapse
Affiliation(s)
- Peipei Wang
- Department of Plant Biology, Michigan State University, East Lansing, MI, 48824, USA.
| | - Ally M Schumacher
- Department of Plant Biology, Michigan State University, East Lansing, MI, 48824, USA
| | - Shin-Han Shiu
- Department of Plant Biology, Michigan State University, East Lansing, MI, 48824, USA; Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI, 48824, USA.
| |
Collapse
|
41
|
Similarity-Based Method with Multiple-Feature Sampling for Predicting Drug Side Effects. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:9547317. [PMID: 35401786 PMCID: PMC8993545 DOI: 10.1155/2022/9547317] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Revised: 09/18/2021] [Accepted: 03/15/2022] [Indexed: 12/23/2022]
Abstract
Drugs can treat different diseases but also bring side effects. Undetected and unaccepted side effects for approved drugs can greatly harm the human body and bring huge risks for pharmaceutical companies. Traditional experimental methods used to determine the side effects have several drawbacks, such as low efficiency and high cost. One alternative to achieve this purpose is to design computational methods. Previous studies modeled a binary classification problem by pairing drugs and side effects; however, their classifiers can only extract one feature from each type of drug association. The present work proposed a novel multiple-feature sampling scheme that can extract several features from one type of drug association. Thirteen classification algorithms were employed to construct classifiers with features yielded by such scheme. Their performance was greatly improved compared with that of the classifiers that use the features yielded by the original scheme. Best performance was observed for the classifier based on random forest with MCC of 0.8661, AUROC of 0.969, and AUPR of 0.977. Finally, one key parameter in the multiple-feature sampling scheme was analyzed.
Collapse
|
42
|
Probst D, Manica M, Nana Teukam YG, Castrogiovanni A, Paratore F, Laino T. Biocatalysed synthesis planning using data-driven learning. Nat Commun 2022; 13:964. [PMID: 35181654 PMCID: PMC8857209 DOI: 10.1038/s41467-022-28536-w] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Accepted: 01/25/2022] [Indexed: 01/30/2023] Open
Abstract
Enzyme catalysts are an integral part of green chemistry strategies towards a more sustainable and resource-efficient chemical synthesis. However, the use of biocatalysed reactions in retrosynthetic planning clashes with the difficulties in predicting the enzymatic activity on unreported substrates and enzyme-specific stereo- and regioselectivity. As of now, only rule-based systems support retrosynthetic planning using biocatalysis, while initial data-driven approaches are limited to forward predictions. Here, we extend the data-driven forward reaction as well as retrosynthetic pathway prediction models based on the Molecular Transformer architecture to biocatalysis. The enzymatic knowledge is learned from an extensive data set of publicly available biochemical reactions with the aid of a new class token scheme based on the enzyme commission classification number, which captures catalysis patterns among different enzymes belonging to the same hierarchy. The forward reaction prediction model (top-1 accuracy of 49.6%), the retrosynthetic pathway (top-1 single-step round-trip accuracy of 39.6%) and the curated data set are made publicly available to facilitate the adoption of enzymatic catalysis in the design of greener chemistry processes.
Collapse
Affiliation(s)
- Daniel Probst
- IBM Research Europe, CH-8803, Rüschlikon, Switzerland.
- National Center for Competence in Research-Catalysis (NCCR-Catalysis), Rüschlikon, Switzerland.
| | - Matteo Manica
- IBM Research Europe, CH-8803, Rüschlikon, Switzerland
| | | | - Alessandro Castrogiovanni
- IBM Research Europe, CH-8803, Rüschlikon, Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis), Rüschlikon, Switzerland
| | | | - Teodoro Laino
- IBM Research Europe, CH-8803, Rüschlikon, Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis), Rüschlikon, Switzerland
| |
Collapse
|
43
|
Dugé de Bernonville T, Amor Stander E, Dugé de Bernonville G, Besseau S, Courdavault V. Predicting Monoterpene Indole Alkaloid-Related Genes from Expression Data with Artificial Neural Networks. Methods Mol Biol 2022; 2505:131-140. [PMID: 35732942 DOI: 10.1007/978-1-0716-2349-7_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Elucidation of biological pathways leading to specialized metabolites remains a complex task. It is however a mandatory step to allow bioproduction into heterologous hosts. Many steps have already been identified using conventional approaches, enlarging the space of known possible chemical steps. In the recent past years, identification of missing steps has been fueled by the generation of genomic and transcriptomic data for nonmodel species. The analysis of gene expression profiles has revealed that in many cases, genes encoding enzymes involved in the same biosynthetic pathways are coexpressed across different tissue types and environmental conditions. Hence, coexpressed studies, either in the form of differential gene expression, gene coexpression network, or unsupervised clustering methods, have helped deciphering missing steps to complete knowledge on biosynthetic pathways. Already identified biosynthetic steps can be used as baits to capture the remaining unknown steps. The present protocol shows how supervised machine learning in the form of artificial neural networks (ANNs) can efficiently classify genes as specialized metabolism related or not according to their expression levels. Using Catharanthus roseus as an example, we show that ANN trained on a minimal set of bait genes results in many true positives (correctly predicted genes) while keeping false positives low (containing possible candidate genes).
Collapse
Affiliation(s)
| | - Emily Amor Stander
- EA2106 Biomolécules et Biotechnologies Végétales, Université de Tours, Tours, France
| | | | - Sébastien Besseau
- EA2106 Biomolécules et Biotechnologies Végétales, Université de Tours, Tours, France
| | - Vincent Courdavault
- EA2106 Biomolécules et Biotechnologies Végétales, Université de Tours, Tours, France
| |
Collapse
|
44
|
Ma X, Ma L, Huo YX. Reconstructing the transcription regulatory network to optimize resource allocation for robust biosynthesis. Trends Biotechnol 2021; 40:735-751. [PMID: 34895933 DOI: 10.1016/j.tibtech.2021.11.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2021] [Revised: 11/07/2021] [Accepted: 11/08/2021] [Indexed: 11/16/2022]
Abstract
An ideal microbial cell factory (MCF) should deliver maximal resources to production, which conflicts with the microbe's native growth-oriented resource allocation strategy and can therefore lead to early termination of the high-yield period. Reallocating resources from growth to production has become a critical factor in constructing robust MCFs. Instead of strengthening specific biosynthetic pathways, emerging endeavors are focused on rearranging the gene regulatory network to fundamentally reprogram the resource allocation pattern. Combining this idea with transcriptional regulation within the hierarchical regulatory network, this review discusses recent engineering strategies targeting the transcription machinery, module networks, regulatory edges, and bottom network layer. This global view will help to construct a production-oriented phenotype that fully harnesses the potential of MCFs.
Collapse
Affiliation(s)
- Xiaoyan Ma
- Key Laboratory of Molecular Medicine and Biotherapy, School of Life Science, Beijing Institute of Technology, 5 South Zhongguancun Street, Haidian District, Beijing 100081, People's Republic of China
| | - Lianjie Ma
- Key Laboratory of Molecular Medicine and Biotherapy, School of Life Science, Beijing Institute of Technology, 5 South Zhongguancun Street, Haidian District, Beijing 100081, People's Republic of China
| | - Yi-Xin Huo
- Key Laboratory of Molecular Medicine and Biotherapy, School of Life Science, Beijing Institute of Technology, 5 South Zhongguancun Street, Haidian District, Beijing 100081, People's Republic of China; Tobacco Research Institute, Chinese Academy of Agricultural Sciences, Qingdao 266101, People's Republic of China.
| |
Collapse
|
45
|
Tinte MM, Chele KH, van der Hooft JJJ, Tugizimana F. Metabolomics-Guided Elucidation of Plant Abiotic Stress Responses in the 4IR Era: An Overview. Metabolites 2021; 11:445. [PMID: 34357339 PMCID: PMC8305945 DOI: 10.3390/metabo11070445] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 06/30/2021] [Accepted: 07/03/2021] [Indexed: 12/27/2022] Open
Abstract
Plants are constantly challenged by changing environmental conditions that include abiotic stresses. These are limiting their development and productivity and are subsequently threatening our food security, especially when considering the pressure of the increasing global population. Thus, there is an urgent need for the next generation of crops with high productivity and resilience to climate change. The dawn of a new era characterized by the emergence of fourth industrial revolution (4IR) technologies has redefined the ideological boundaries of research and applications in plant sciences. Recent technological advances and machine learning (ML)-based computational tools and omics data analysis approaches are allowing scientists to derive comprehensive metabolic descriptions and models for the target plant species under specific conditions. Such accurate metabolic descriptions are imperatively essential for devising a roadmap for the next generation of crops that are resilient to environmental deterioration. By synthesizing the recent literature and collating data on metabolomics studies on plant responses to abiotic stresses, in the context of the 4IR era, we point out the opportunities and challenges offered by omics science, analytical intelligence, computational tools and big data analytics. Specifically, we highlight technological advancements in (plant) metabolomics workflows and the use of machine learning and computational tools to decipher the dynamics in the chemical space that define plant responses to abiotic stress conditions.
Collapse
Affiliation(s)
- Morena M. Tinte
- Department of Biochemistry, University of Johannesburg, Auckland Park, Johannesburg 2006, South Africa; (M.M.T.); (K.H.C.)
| | - Kekeletso H. Chele
- Department of Biochemistry, University of Johannesburg, Auckland Park, Johannesburg 2006, South Africa; (M.M.T.); (K.H.C.)
| | | | - Fidele Tugizimana
- Department of Biochemistry, University of Johannesburg, Auckland Park, Johannesburg 2006, South Africa; (M.M.T.); (K.H.C.)
- International Research and Development Division, Omnia Group, Ltd., Johannesburg 2021, South Africa
| |
Collapse
|
46
|
Shah HA, Liu J, Yang Z, Feng J. Review of Machine Learning Methods for the Prediction and Reconstruction of Metabolic Pathways. Front Mol Biosci 2021; 8:634141. [PMID: 34222327 PMCID: PMC8247443 DOI: 10.3389/fmolb.2021.634141] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 06/01/2021] [Indexed: 11/13/2022] Open
Abstract
Prediction and reconstruction of metabolic pathways play significant roles in many fields such as genetic engineering, metabolic engineering, drug discovery, and are becoming the most active research topics in synthetic biology. With the increase of related data and with the development of machine learning techniques, there have many machine leaning based methods been proposed for prediction or reconstruction of metabolic pathways. Machine learning techniques are showing state-of-the-art performance to handle the rapidly increasing volume of data in synthetic biology. To support researchers in this field, we briefly review the research progress of metabolic pathway reconstruction and prediction based on machine learning. Some challenging issues in the reconstruction of metabolic pathways are also discussed in this paper.
Collapse
Affiliation(s)
- Hayat Ali Shah
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, China
| | - Juan Liu
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, China
| | - Zhihui Yang
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, China
| | - Jing Feng
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan, China
| |
Collapse
|
47
|
Lopez-Ibañez J, Pazos F, Chagoyen M. Predicting biological pathways of chemical compounds with a profile-inspired approach. BMC Bioinformatics 2021; 22:320. [PMID: 34118870 PMCID: PMC8199418 DOI: 10.1186/s12859-021-04252-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Accepted: 06/09/2021] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Assignment of chemical compounds to biological pathways is a crucial step to understand the relationship between the chemical repertory of an organism and its biology. Protein sequence profiles are very successful in capturing the main structural and functional features of a protein family, and can be used to assign new members to it based on matching of their sequences against these profiles. In this work, we extend this idea to chemical compounds, constructing a profile-inspired model for a set of related metabolites (those in the same biological pathway), based on a fragment-based vectorial representation of their chemical structures. RESULTS We use this representation to predict the biological pathway of a chemical compound with good overall accuracy (AUC 0.74-0.90 depending on the database tested), and analyzed some factors that affect performance. The approach, which is compared with equivalent methods, can in addition detect those molecular fragments characteristic of a pathway. CONCLUSIONS The method is available as a graphical interactive web server http://csbg.cnb.csic.es/iFragMent .
Collapse
Affiliation(s)
- Javier Lopez-Ibañez
- Computational Systems Biology Group, National Center for Biotecnology (CNB-CSIC), Darwin 3, 28049, Madrid, Spain
| | - Florencio Pazos
- Computational Systems Biology Group, National Center for Biotecnology (CNB-CSIC), Darwin 3, 28049, Madrid, Spain
| | - Monica Chagoyen
- Computational Systems Biology Group, National Center for Biotecnology (CNB-CSIC), Darwin 3, 28049, Madrid, Spain.
| |
Collapse
|
48
|
Muzio G, O’Bray L, Borgwardt K. Biological network analysis with deep learning. Brief Bioinform 2021; 22:1515-1530. [PMID: 33169146 PMCID: PMC7986589 DOI: 10.1093/bib/bbaa257] [Citation(s) in RCA: 101] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 08/26/2020] [Accepted: 09/11/2020] [Indexed: 12/17/2022] Open
Abstract
Recent advancements in experimental high-throughput technologies have expanded the availability and quantity of molecular data in biology. Given the importance of interactions in biological processes, such as the interactions between proteins or the bonds within a chemical compound, this data is often represented in the form of a biological network. The rise of this data has created a need for new computational tools to analyze networks. One major trend in the field is to use deep learning for this goal and, more specifically, to use methods that work with networks, the so-called graph neural networks (GNNs). In this article, we describe biological networks and review the principles and underlying algorithms of GNNs. We then discuss domains in bioinformatics in which graph neural networks are frequently being applied at the moment, such as protein function prediction, protein-protein interaction prediction and in silico drug discovery and development. Finally, we highlight application areas such as gene regulatory networks and disease diagnosis where deep learning is emerging as a new tool to answer classic questions like gene interaction prediction and automatic disease prediction from data.
Collapse
Affiliation(s)
- Giulia Muzio
- Machine Learning and Computational Biology Lab at ETH Zürich
| | - Leslie O’Bray
- Machine Learning and Computational Biology Lab at ETH Zürich
| | | |
Collapse
|
49
|
Auslander N, Gussow AB, Koonin EV. Incorporating Machine Learning into Established Bioinformatics Frameworks. Int J Mol Sci 2021; 22:2903. [PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/08/2021] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open
Abstract
The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges.
Collapse
Affiliation(s)
| | | | - Eugene V. Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
| |
Collapse
|
50
|
Yuan F, Li Z, Chen L, Zeng T, Zhang YH, Ding S, Huang T, Cai YD. Identifying the Signatures and Rules of Circulating Extracellular MicroRNA for Distinguishing Cancer Subtypes. Front Genet 2021; 12:651610. [PMID: 33767734 PMCID: PMC7985347 DOI: 10.3389/fgene.2021.651610] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 02/10/2021] [Indexed: 12/24/2022] Open
Abstract
Cancer is one of the most threatening diseases to humans. It can invade multiple significant organs, including lung, liver, stomach, pancreas, and even brain. The identification of cancer biomarkers is one of the most significant components of cancer studies as the foundation of clinical cancer diagnosis and related drug development. During the large-scale screening for cancer prevention and early diagnosis, obtaining cancer-related tissues is impossible. Thus, the identification of cancer-associated circulating biomarkers from liquid biopsy targeting has been proposed and has become the most important direction for research on clinical cancer diagnosis. Here, we analyzed pan-cancer extracellular microRNA profiles by using multiple machine-learning models. The extracellular microRNA profiles on 11 cancer types and non-cancer were first analyzed by Boruta to extract important microRNAs. Selected microRNAs were then evaluated by the Max-Relevance and Min-Redundancy feature selection method, resulting in a feature list, which were fed into the incremental feature selection method to identify candidate circulating extracellular microRNA for cancer recognition and classification. A series of quantitative classification rules was also established for such cancer classification, thereby providing a solid research foundation for further biomarker exploration and functional analyses of tumorigenesis at the level of circulating extracellular microRNA.
Collapse
Affiliation(s)
- Fei Yuan
- School of Life Sciences, Shanghai University, Shanghai, China
- Department of Science and Technology, Binzhou Medical University Hospital, Binzhou, China
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Tao Zeng
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, United States
| | - Shijian Ding
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|