1
|
Yuan GH, Li J, Yang Z, Chen YQ, Yuan Z, Chen T, Ouyang W, Dong N, Yang L. Deep generative model for protein subcellular localization prediction. Brief Bioinform 2025; 26:bbaf152. [PMID: 40211979 PMCID: PMC11986326 DOI: 10.1093/bib/bbaf152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 03/18/2025] [Accepted: 03/19/2025] [Indexed: 04/14/2025] Open
Abstract
Protein sequence not only determines its structure but also provides important clues of its subcellular localization. Although a series of artificial intelligence models have been reported to predict protein subcellular localization, most of them provide only textual outputs. Here, we present deepGPS, a deep generative model for protein subcellular localization prediction. After training with protein primary sequences and fluorescence images, deepGPS shows the ability to predict cytoplasmic and nuclear localizations by reporting both textual labels and generative images as outputs. In addition, cell-type-specific deepGPS models can be developed by using distinct image datasets from different cell lines for comparative analyses. Moreover, deepGPS shows potential to be further extended for other specific organelles, such as vesicles and endoplasmic reticulum, even with limited volumes of training data. Finally, the openGPS website (https://bits.fudan.edu.cn/opengps) is constructed to provide a publicly accessible and user-friendly platform for studying protein subcellular localization and function.
Collapse
Affiliation(s)
- Guo-Hua Yuan
- Center for Molecular Medicine, Children’s Hospital of Fudan University and Shanghai Key Laboratory of Medical Epigenetics, International Laboratory of Medical Epigenetics and Metabolism, Ministry of Science and Technology, Institutes of Biomedical Sciences, Fudan University, 131 Dongan Road, Xuhui District, Shanghai 200032, China
| | - Jinzhe Li
- Shanghai Artificial Intelligence Laboratory, 129 Longwen Road, Xuhui District, Shanghai 200232, China
- School of Information Science and Technology, Fudan University, 2005 Songhu Road, Yangpu District, Shanghai 200433, China
| | - Zejun Yang
- Shanghai Artificial Intelligence Laboratory, 129 Longwen Road, Xuhui District, Shanghai 200232, China
| | - Yao-Qi Chen
- Center for Molecular Medicine, Children’s Hospital of Fudan University and Shanghai Key Laboratory of Medical Epigenetics, International Laboratory of Medical Epigenetics and Metabolism, Ministry of Science and Technology, Institutes of Biomedical Sciences, Fudan University, 131 Dongan Road, Xuhui District, Shanghai 200032, China
| | - Zhonghang Yuan
- Shanghai Artificial Intelligence Laboratory, 129 Longwen Road, Xuhui District, Shanghai 200232, China
| | - Tao Chen
- School of Information Science and Technology, Fudan University, 2005 Songhu Road, Yangpu District, Shanghai 200433, China
| | - Wanli Ouyang
- Shanghai Artificial Intelligence Laboratory, 129 Longwen Road, Xuhui District, Shanghai 200232, China
| | - Nanqing Dong
- Shanghai Artificial Intelligence Laboratory, 129 Longwen Road, Xuhui District, Shanghai 200232, China
- Shanghai Innovation Institute, 699 Huafa Road, Xuhui District, Shanghai 200231, China
| | - Li Yang
- Center for Molecular Medicine, Children’s Hospital of Fudan University and Shanghai Key Laboratory of Medical Epigenetics, International Laboratory of Medical Epigenetics and Metabolism, Ministry of Science and Technology, Institutes of Biomedical Sciences, Fudan University, 131 Dongan Road, Xuhui District, Shanghai 200032, China
| |
Collapse
|
2
|
Li M, Dalton K, Hekstra D. SFCalculator: connecting deep generative models and crystallography. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.12.632630. [PMID: 39868231 PMCID: PMC11760793 DOI: 10.1101/2025.01.12.632630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
Proteins drive biochemical transformations by transitioning through distinct conformational states. Understanding these states is essential for modulating protein function. Although X-ray crystallography has enabled revolutionary advances in protein structure prediction by machine learning, this connection was made at the level of atomic models, not the underlying data. This lack of connection to crystallographic data limits the potential for further advances in both the accuracy of protein structure prediction and the application of machine learning to experimental structure determination. Here, we present SFCalculator, a differentiable pipeline that generates crystallographic observables from atomistic molecular structures with bulk solvent correction, bridging crystallographic data and neural network-based molecular modeling. We validate SFCalculator against conventional methods and demonstrate its utility by establishing three important proof-of-concept applications. First, SFCalculator enables accurate placement of molecular models relative to crystal lattices (known as phasing). Second, SFCalculator enables the search of the latent space of generative models for conformations that fit crystallographic data and are, therefore, also implicitly constrained by the information encoded by the model. Finally, SFCalculator enables the use of crystallographic data during training of generative models, enabling these models to generate an ensemble of conformations consistent with crystallographic data. SFCalculator, therefore, enables a new generation of analytical paradigms integrating crystallographic data and machine learning.
Collapse
Affiliation(s)
- Minhuan Li
- John A. Paulson School of Engineering & Applied Sciences, Harvard University
| | - Kevin Dalton
- Department of Molecular & Cellular Biology, Harvard University
- LCLS Data Systems, SLAC National Accelerator Laboratory
| | - Doeke Hekstra
- John A. Paulson School of Engineering & Applied Sciences, Harvard University
- Department of Molecular & Cellular Biology, Harvard University
| |
Collapse
|
3
|
Zhang K, Yang X, Wang Y, Yu Y, Huang N, Li G, Li X, Wu JC, Yang S. Artificial intelligence in drug development. Nat Med 2025; 31:45-59. [PMID: 39833407 DOI: 10.1038/s41591-024-03434-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 11/25/2024] [Indexed: 01/22/2025]
Abstract
Drug development is a complex and time-consuming endeavor that traditionally relies on the experience of drug developers and trial-and-error experimentation. The advent of artificial intelligence (AI) technologies, particularly emerging large language models and generative AI, is poised to redefine this paradigm. The integration of AI-driven methodologies into the drug development pipeline has already heralded subtle yet meaningful enhancements in both the efficiency and effectiveness of this process. Here we present an overview of recent advancements in AI applications across the entire drug development workflow, encompassing the identification of disease targets, drug discovery, preclinical and clinical studies, and post-market surveillance. Lastly, we critically examine the prevailing challenges to highlight promising future research directions in AI-augmented drug development.
Collapse
Affiliation(s)
- Kang Zhang
- Eye Hospital and Institute for Advanced Study on Eye Health and Diseases, Institute for clinical Data Science, Wenzhou Medical University, Wenzhou, China.
- State Key Laboratory of Macromolecular Drugs and Large-Scale Preparation, Wenzhou Medical University, Wenzhou, China.
| | - Xin Yang
- Department of Biotherapy, Cancer Center and State Key Laboratory of Biotherapy, West China Hospital, Sichuan University, Chengdu, China
| | - Yifei Wang
- Department of Biotherapy, Cancer Center and State Key Laboratory of Biotherapy, West China Hospital, Sichuan University, Chengdu, China
| | - Yunfang Yu
- Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, China
- Institute for AI in Medicine and faculty of Medicine, Macau University of Science and Technology, Macau, China
- Guangzhou National Laboratory, Guangzhou, China
| | - Niu Huang
- National Institute of Biological Sciences, Beijing, China
| | - Gen Li
- Eye Hospital and Institute for Advanced Study on Eye Health and Diseases, Institute for clinical Data Science, Wenzhou Medical University, Wenzhou, China
- Guangzhou National Laboratory, Guangzhou, China
- Eye and Vision Innovation Center, Eye Valley, Wenzhou, China
| | - Xiaokun Li
- State Key Laboratory of Macromolecular Drugs and Large-Scale Preparation, Wenzhou Medical University, Wenzhou, China
| | - Joseph C Wu
- Cardiovascular Research Institute, Stanford University, Stanford, CA, USA
| | - Shengyong Yang
- Department of Biotherapy, Cancer Center and State Key Laboratory of Biotherapy, West China Hospital, Sichuan University, Chengdu, China.
| |
Collapse
|
4
|
Raimondi D, Passemiers A, Verplaetse N, Corso M, Ferrero-Serrano Á, Nazzicari N, Biscarini F, Fariselli P, Moreau Y. Biologically meaningful genome interpretation models to address data underdetermination for the leaf and seed ionome prediction in Arabidopsis thaliana. Sci Rep 2024; 14:13188. [PMID: 38851759 PMCID: PMC11162433 DOI: 10.1038/s41598-024-63855-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 06/03/2024] [Indexed: 06/10/2024] Open
Abstract
Genome interpretation (GI) encompasses the computational attempts to model the relationship between genotype and phenotype with the goal of understanding how the first leads to the second. While traditional approaches have focused on sub-problems such as predicting the effect of single nucleotide variants or finding genetic associations, recent advances in neural networks (NNs) have made it possible to develop end-to-end GI models that take genomic data as input and predict phenotypes as output. However, technical and modeling issues still need to be fixed for these models to be effective, including the widespread underdetermination of genomic datasets, making them unsuitable for training large, overfitting-prone, NNs. Here we propose novel GI models to address this issue, exploring the use of two types of transfer learning approaches and proposing a novel Biologically Meaningful Sparse NN layer specifically designed for end-to-end GI. Our models predict the leaf and seed ionome in A.thaliana, obtaining comparable results to our previous over-parameterized model while reducing the number of parameters by 8.8 folds. We also investigate how the effect of population stratification influences the evaluation of the performances, highlighting how it leads to (1) an instance of the Simpson's Paradox, and (2) model generalization limitations.
Collapse
Affiliation(s)
| | | | | | - Massimiliano Corso
- Université Paris-Saclay, INRAE, AgroParisTech, Institute Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France
| | - Ángel Ferrero-Serrano
- Department of Biology, Pennsylvania State University, University Park, PA, 16802, USA
| | | | | | - Piero Fariselli
- Department of Medical Sciences, University of Torino, 10123, Turin, Italy
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, 3001, Leuven, Belgium
| |
Collapse
|
5
|
Bitencourt-Ferreira G, Villarreal MA, Quiroga R, Biziukova N, Poroikov V, Tarasova O, de Azevedo Junior WF. Exploring Scoring Function Space: Developing Computational Models for Drug Discovery. Curr Med Chem 2024; 31:2361-2377. [PMID: 36944627 DOI: 10.2174/0929867330666230321103731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 12/15/2022] [Accepted: 12/29/2022] [Indexed: 03/23/2023]
Abstract
BACKGROUND The idea of scoring function space established a systems-level approach to address the development of models to predict the affinity of drug molecules by those interested in drug discovery. OBJECTIVE Our goal here is to review the concept of scoring function space and how to explore it to develop machine learning models to address protein-ligand binding affinity. METHODS We searched the articles available in PubMed related to the scoring function space. We also utilized crystallographic structures found in the protein data bank (PDB) to represent the protein space. RESULTS The application of systems-level approaches to address receptor-drug interactions allows us to have a holistic view of the process of drug discovery. The scoring function space adds flexibility to the process since it makes it possible to see drug discovery as a relationship involving mathematical spaces. CONCLUSION The application of the concept of scoring function space has provided us with an integrated view of drug discovery methods. This concept is useful during drug discovery, where we see the process as a computational search of the scoring function space to find an adequate model to predict receptor-drug binding affinity.
Collapse
Affiliation(s)
| | - Marcos A Villarreal
- CONICET-Departamento de Matemática y Física, Instituto de Investigaciones en Fisicoquímica de Córdoba (INFIQC), Facultad de Ciencias Químicas, Universidad Nacional de Córdoba, Ciudad Universitaria, Córdoba, Argentina
| | - Rodrigo Quiroga
- CONICET-Departamento de Matemática y Física, Instituto de Investigaciones en Fisicoquímica de Córdoba (INFIQC), Facultad de Ciencias Químicas, Universidad Nacional de Córdoba, Ciudad Universitaria, Córdoba, Argentina
| | - Nadezhda Biziukova
- Institute of Biomedical Chemistry, Pogodinskaya Str., 10/8, Moscow, 119121, Russia
| | - Vladimir Poroikov
- Institute of Biomedical Chemistry, Pogodinskaya Str., 10/8, Moscow, 119121, Russia
| | - Olga Tarasova
- Institute of Biomedical Chemistry, Pogodinskaya Str., 10/8, Moscow, 119121, Russia
| | - Walter F de Azevedo Junior
- Pontifical Catholic University of Rio Grande do Sul - PUCRS, Porto Alegre-RS, Brazil
- Specialization Program in Bioinformatics, The Pontifical Catholic University of Rio Grande do Sul (PUCRS), Av. Ipiranga, 6681 Porto Alegre / RS 90619-900, Brazil
| |
Collapse
|
6
|
Yang Z, Wang Y, Ni X, Yang S. DeepDRP: Prediction of intrinsically disordered regions based on integrated view deep learning architecture from transformer-enhanced and protein information. Int J Biol Macromol 2023; 253:127390. [PMID: 37827403 DOI: 10.1016/j.ijbiomac.2023.127390] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 09/20/2023] [Accepted: 10/09/2023] [Indexed: 10/14/2023]
Abstract
Intrinsic disorder in proteins, a widely distributed phenomenon in nature, is related to many crucial biological processes and various diseases. Traditional determination methods tend to be costly and labor-intensive, therefore it is desirable to seek an accurate identification method of intrinsically disordered proteins (IDPs). In this paper, we proposed a novel Deep learning model for Intrinsically Disordered Regions in Proteins named DeepDRP. DeepDRP employed an innovative TimeDistributed strategy and Bi-LSTM architecture to predict IDPs and is driven by integrated view features of PSSM, Energy-based encoding, AAindex, and transformer-enhanced embeddings including DR-BERT, OntoProtein, Prot-T5, and ESM-2. The comparison of different feature combinations indicates that the transformer-enhanced features contribute far more than traditional features to predict IDPs and ESM-2 accounts for a larger contribution in the pre-trained fusion vectors. The ablation test verified that the TimeDistributed strategy surely increased the model performance and is an efficient approach to the IDP prediction. Compared with eight state-of-the-art methods on the DISORDER723, S1, and DisProt832 datasets, the Matthews correlation coefficient of DeepDRP significantly outperformed competing methods by 4.90 % to 36.20 %, 11.80 % to 26.33 %, and 4.82 % to 13.55 %. In brief, DeepDRP is a reliable model for IDP prediction and is freely available at https://github.com/ZX-COLA/DeepDRP.
Collapse
Affiliation(s)
- Zexi Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Xinye Ni
- The Affiliated Changzhou No. 2 People's Hospital of Nanjing Medical University, Changzhou 213164, China
| | - Sen Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China; The Affiliated Changzhou No. 2 People's Hospital of Nanjing Medical University, Changzhou 213164, China.
| |
Collapse
|
7
|
Raimondi D, Chizari H, Verplaetse N, Löscher BS, Franke A, Moreau Y. Genome interpretation in a federated learning context allows the multi-center exome-based risk prediction of Crohn's disease patients. Sci Rep 2023; 13:19449. [PMID: 37945674 PMCID: PMC10636050 DOI: 10.1038/s41598-023-46887-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 11/06/2023] [Indexed: 11/12/2023] Open
Abstract
High-throughput sequencing allowed the discovery of many disease variants, but nowadays it is becoming clear that the abundance of genomics data mostly just moved the bottleneck in Genetics and Precision Medicine from a data availability issue to a data interpretation issue. To solve this empasse it would be beneficial to apply the latest Deep Learning (DL) methods to the Genome Interpretation (GI) problem, similarly to what AlphaFold did for Structural Biology. Unfortunately DL requires large datasets to be viable, and aggregating genomics datasets poses several legal, ethical and infrastructural complications. Federated Learning (FL) is a Machine Learning (ML) paradigm designed to tackle these issues. It allows ML methods to be collaboratively trained and tested on collections of physically separate datasets, without requiring the actual centralization of sensitive data. FL could thus be key to enable DL applications to GI on sufficiently large genomics data. We propose FedCrohn, a FL GI Neural Network model for the exome-based Crohn's Disease risk prediction, providing a proof-of-concept that FL is a viable paradigm to build novel ML GI approaches. We benchmark it in several realistic scenarios, showing that FL can indeed provide performances similar to conventional ML on centralized data, and that collaborating in FL initiatives is likely beneficial for most of the medical centers participating in them.
Collapse
Affiliation(s)
| | | | | | - Britt-Sabina Löscher
- Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany
- University Medical Center Schleswig-Holstein, Kiel, Germany
| | - Andre Franke
- Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany
- University Medical Center Schleswig-Holstein, Kiel, Germany
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, 3001, Leuven, Belgium
| |
Collapse
|