1
|
Mishra VP, Singh YN, Khan F, Dutta MK. SeqDPI: A 1D-CNN approach for predicting binding affinity of kinase inhibitors. J Comput Chem 2025; 46:e27518. [PMID: 39644133 DOI: 10.1002/jcc.27518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 08/26/2024] [Accepted: 10/13/2024] [Indexed: 12/09/2024]
Abstract
Predicting drug target binding affinity has huge relevance in Modern drug discovery and drug repositioning processes which assist doctors to come up with new drugs or even use the existing drugs for new target proteins. In silico models, using advanced deep learning techniques could further assist these prediction tasks by providing most prominent drug target pairs. Considering these factors, a deep learning based algorithmic framework is developed in this study to support drug target interaction prediction. The proposed SeqDPI model extract the relevant drug and protein features from the one dimensional Sequential representation of the dataset considered using optimized CNN networks that deploy convolutions on varying length of amino acid subsequence's to capture hidden pattern, the convolved drug- protein features obtained are then used as an input to L2 penalized feed forward neural network which matches the local residue patterns in protein classes with molecular fingerprints of drugs to predict the binding strength for all drug target pairs. The proposed model reduces the convolution strain typically encountered in existing in silico models that utilize complex 3D structures of drug protein datasets. The result shows that the SeqDPI model achieves a mean square error MSE of (0.167) across cross validation folds, outperforming baseline models such as KronRLS (0.406), Simboost (0.226), and DeepPS (0.214). Additionally, SeqDPI attains a high CI score of 0.9114 on the benchmark KIBA dataset, demonstrating its statistical significance and computational efficiency compared to existing methods. This gives the relevance and effectiveness of SeqDPI model in accurately predicting binding affinities while working with simpler one-dimensional data, making it a robust and computationally cost-effective solution for drug-target interaction prediction.
Collapse
Affiliation(s)
- Vinay Priy Mishra
- Centre for Advanced Studies, Dr. A.P.J. Abdul Kalam Technical University, Lucknow, India
| | - Yogendra Narain Singh
- Department of Computer Science & Engineering, Institute of Engineering and Technology, Lucknow, India
| | - Feroz Khan
- Technology Dissemination & Computational Biology Division, CSIR-Central Institute of Medicinal and Aromatic Plants, Lucknow, India
| | | |
Collapse
|
2
|
Wu Y, Xie L, Liu Y, Xie L. Semi-supervised meta-learning elucidates understudied molecular interactions. Commun Biol 2024; 7:1104. [PMID: 39251833 PMCID: PMC11383949 DOI: 10.1038/s42003-024-06797-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2024] [Accepted: 08/28/2024] [Indexed: 09/11/2024] Open
Abstract
Many biological problems are understudied due to experimental limitations and human biases. Although deep learning is promising in accelerating scientific discovery, its power compromises when applied to problems with scarcely labeled data and data distribution shifts. We develop a deep learning framework-Meta Model Agnostic Pseudo Label Learning (MMAPLE)-to address these challenges by effectively exploring out-of-distribution (OOD) unlabeled data when conventional transfer learning fails. The uniqueness of MMAPLE is to integrate the concept of meta-learning, transfer learning and semi-supervised learning into a unified framework. The power of MMAPLE is demonstrated in three applications in an OOD setting where chemicals or proteins in unseen data are dramatically different from those in training data: predicting drug-target interactions, hidden human metabolite-enzyme interactions, and understudied interspecies microbiome metabolite-human receptor interactions. MMAPLE achieves 11% to 242% improvement in the prediction-recall on multiple OOD benchmarks over various base models. Using MMAPLE, we reveal novel interspecies metabolite-protein interactions that are validated by activity assays and fill in missing links in microbiome-human interactions. MMAPLE is a general framework to explore previously unrecognized biological domains beyond the reach of present experimental and computational techniques.
Collapse
Affiliation(s)
- You Wu
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, NY, USA
| | - Li Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, NY, USA
| | - Yang Liu
- Department of Computer Science, Hunter College, The City University of New York, New York, NY, USA
| | - Lei Xie
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, NY, USA.
- Department of Computer Science, Hunter College, The City University of New York, New York, NY, USA.
- Helen & Robert Appel Alzheimer's Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, NY, USA.
| |
Collapse
|
3
|
Lavecchia A. Advancing drug discovery with deep attention neural networks. Drug Discov Today 2024; 29:104067. [PMID: 38925473 DOI: 10.1016/j.drudis.2024.104067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 06/10/2024] [Accepted: 06/19/2024] [Indexed: 06/28/2024]
Abstract
In the dynamic field of drug discovery, deep attention neural networks are revolutionizing our approach to complex data. This review explores the attention mechanism and its extended architectures, including graph attention networks (GATs), transformers, bidirectional encoder representations from transformers (BERT), generative pre-trained transformers (GPTs) and bidirectional and auto-regressive transformers (BART). Delving into their core principles and multifaceted applications, we uncover their pivotal roles in catalyzing de novo drug design, predicting intricate molecular properties and deciphering elusive drug-target interactions. Despite challenges, these attention-based architectures hold unparalleled promise to drive transformative breakthroughs and accelerate progress in pharmaceutical research.
Collapse
Affiliation(s)
- Antonio Lavecchia
- Drug Discovery Laboratory, Department of Pharmacy, University of Napoli Federico II, I-80131 Naples, Italy.
| |
Collapse
|
4
|
Wang K, Kim N, Bagherian M, Li K, Chou E, Colacino JA, Dolinoy DC, Sartor MA. Gene Target Prediction of Environmental Chemicals Using Coupled Matrix-Matrix Completion. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024; 58:5889-5898. [PMID: 38501580 PMCID: PMC11131040 DOI: 10.1021/acs.est.4c00458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
Human exposure to toxic chemicals presents a huge health burden. Key to understanding chemical toxicity is knowledge of the molecular target(s) of the chemicals. Because a comprehensive safety assessment for all chemicals is infeasible due to limited resources, a robust computational method for discovering targets of environmental exposures is a promising direction for public health research. In this study, we implemented a novel matrix completion algorithm named coupled matrix-matrix completion (CMMC) for predicting direct and indirect exposome-target interactions, which exploits the vast amount of accumulated data regarding chemical exposures and their molecular targets. Our approach achieved an AUC of 0.89 on a benchmark data set generated using data from the Comparative Toxicogenomics Database. Our case studies with bisphenol A and its analogues, PFAS, dioxins, PCBs, and VOCs show that CMMC can be used to accurately predict molecular targets of novel chemicals without any prior bioactivity knowledge. Our results demonstrate the feasibility and promise of computationally predicting environmental chemical-target interactions to efficiently prioritize chemicals in hazard identification and risk assessment.
Collapse
Affiliation(s)
- Kai Wang
- Department of Computational Medicine and Bioinformatics, School of Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| | - Nicole Kim
- Department of Computational Medicine and Bioinformatics, School of Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| | - Maryam Bagherian
- Department of Computational Medicine and Bioinformatics, School of Medicine, University of Michigan, Ann Arbor, MI 48109, USA
- Michigan Institute for Data Science (MIDAS), University of Michigan, Ann Arbor, MI 48109, USA
| | - Kai Li
- Department of Computational Medicine and Bioinformatics, School of Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| | - Elysia Chou
- Department of Computational Medicine and Bioinformatics, School of Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| | - Justin A. Colacino
- Department of Environmental Health Sciences, School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Nutritional Sciences, School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
| | - Dana C. Dolinoy
- Department of Environmental Health Sciences, School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Nutritional Sciences, School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
| | - Maureen A. Sartor
- Department of Computational Medicine and Bioinformatics, School of Medicine, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
5
|
Wu Y, Xie L, Liu Y, Xie L. Model Agnostic Semi-Supervised Meta-Learning Elucidates Understudied Out-of-distribution Molecular Interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.17.541172. [PMID: 37292680 PMCID: PMC10245663 DOI: 10.1101/2023.05.17.541172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Many biological problems are understudied due to experimental limitations and human biases. Although deep learning is promising in accelerating scientific discovery, its power compromises when applied to problems with scarcely labeled data and data distribution shifts. We developed a semi-supervised meta learning framework - Meta Model Agnostic Pseudo Label Learning (MMAPLE) - to address these challenges by effectively exploring out-of-distribution (OOD) unlabeled data when transfer learning fails. The power of MMAPLE is demonstrated in multiple applications: predicting OOD drug-target interactions, hidden human metabolite-enzyme interactions, and understudied interspecies microbiome metabolite-human receptor interactions, where chemicals or proteins in unseen data are dramatically different from those in training data. MMAPLE achieves 11% to 242% improvement in the prediction-recall on multiple OOD benchmarks over baseline models. Using MMAPLE, we reveal novel interspecies metabolite-protein interactions that are validated by bioactivity assays and fill in missing links in microbiome-human interactions. MMAPLE is a general framework to explore previously unrecognized biological domains beyond the reach of present experimental and computational techniques.
Collapse
Affiliation(s)
- You Wu
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, New York, USA
| | - Li Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, USA
| | - Yang Liu
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, USA
| | - Lei Xie
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, New York, USA
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, USA
- Helen & Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, New York, USA
| |
Collapse
|
6
|
Jobe A, Vijayan R. Orphan G protein-coupled receptors: the ongoing search for a home. Front Pharmacol 2024; 15:1349097. [PMID: 38495099 PMCID: PMC10941346 DOI: 10.3389/fphar.2024.1349097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 02/15/2024] [Indexed: 03/19/2024] Open
Abstract
G protein-coupled receptors (GPCRs) make up the largest receptor superfamily, accounting for 4% of protein-coding genes. Despite the prevalence of such transmembrane receptors, a significant number remain orphans, lacking identified endogenous ligands. Since their conception, the reverse pharmacology approach has been used to characterize such receptors. However, the multifaceted and nuanced nature of GPCR signaling poses a great challenge to their pharmacological elucidation. Considering their therapeutic relevance, the search for native orphan GPCR ligands continues. Despite limited structural input in terms of 3D crystallized structures, with advances in machine-learning approaches, there has been great progress with respect to accurate ligand prediction. Though such an approach proves valuable given that ligand scarcity is the greatest hurdle to orphan GPCR deorphanization, the future pairings of the remaining orphan GPCRs may not necessarily take a one-size-fits-all approach but should be more comprehensive in accounting for numerous nuanced possibilities to cover the full spectrum of GPCR signaling.
Collapse
Affiliation(s)
- Amie Jobe
- Department of Biology, College of Science, United Arab Emirates University, Al Ain, United Arab Emirates
| | - Ranjit Vijayan
- Department of Biology, College of Science, United Arab Emirates University, Al Ain, United Arab Emirates
- The Big Data Analytics Center, United Arab Emirates University, Al Ain, United Arab Emirates
- Zayed Bin Sultan Center for Health Sciences, United Arab Emirates University, Al Ain, United Arab Emirates
| |
Collapse
|
7
|
Zhang Y, Liu C, Liu M, Liu T, Lin H, Huang CB, Ning L. Attention is all you need: utilizing attention in AI-enabled drug discovery. Brief Bioinform 2023; 25:bbad467. [PMID: 38189543 PMCID: PMC10772984 DOI: 10.1093/bib/bbad467] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 11/03/2023] [Accepted: 11/25/2023] [Indexed: 01/09/2024] Open
Abstract
Recently, attention mechanism and derived models have gained significant traction in drug development due to their outstanding performance and interpretability in handling complex data structures. This review offers an in-depth exploration of the principles underlying attention-based models and their advantages in drug discovery. We further elaborate on their applications in various aspects of drug development, from molecular screening and target binding to property prediction and molecule generation. Finally, we discuss the current challenges faced in the application of attention mechanisms and Artificial Intelligence technologies, including data quality, model interpretability and computational resource constraints, along with future directions for research. Given the accelerating pace of technological advancement, we believe that attention-based models will have an increasingly prominent role in future drug discovery. We anticipate that these models will usher in revolutionary breakthroughs in the pharmaceutical domain, significantly accelerating the pace of drug development.
Collapse
Affiliation(s)
- Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Caiqi Liu
- Department of Gastrointestinal Medical Oncology, Harbin Medical University Cancer Hospital, No.150 Haping Road, Nangang District, Harbin, Heilongjiang 150081, China
- Key Laboratory of Molecular Oncology of Heilongjiang Province, No.150 Haping Road, Nangang District, Harbin, Heilongjiang 150081, China
| | - Mujiexin Liu
- Chongqing Key Laboratory of Sichuan-Chongqing Co-construction for Diagnosis and Treatment of Infectious Diseases Integrated Traditional Chinese and Western Medicine, College of Medical Technology, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Tianyuan Liu
- Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Japan
| | - Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Cheng-Bing Huang
- School of Computer Science and Technology, Aba Teachers University, Aba, China
| | - Lin Ning
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| |
Collapse
|
8
|
Markus B, C GC, Andreas K, Arkadij K, Stefan L, Gustav O, Elina S, Radka S. Accelerating Biocatalysis Discovery with Machine Learning: A Paradigm Shift in Enzyme Engineering, Discovery, and Design. ACS Catal 2023; 13:14454-14469. [PMID: 37942268 PMCID: PMC10629211 DOI: 10.1021/acscatal.3c03417] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 09/29/2023] [Accepted: 10/03/2023] [Indexed: 11/10/2023]
Abstract
Emerging computational tools promise to revolutionize protein engineering for biocatalytic applications and accelerate the development timelines previously needed to optimize an enzyme to its more efficient variant. For over a decade, the benefits of predictive algorithms have helped scientists and engineers navigate the complexity of functional protein sequence space. More recently, spurred by dramatic advances in underlying computational tools, the promise of faster, cheaper, and more accurate enzyme identification, characterization, and engineering has catapulted terms such as artificial intelligence and machine learning to the must-have vocabulary in the field. This Perspective aims to showcase the current status of applications in pharmaceutical industry and also to discuss and celebrate the innovative approaches in protein science by highlighting their potential in selected recent developments and offering thoughts on future opportunities for biocatalysis. It also critically assesses the technology's limitations, unanswered questions, and unmet challenges.
Collapse
Affiliation(s)
- Braun Markus
- Department
of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010 Graz, Austria
| | - Gruber Christian C
- Enzyme
and Drug Discovery, Innophore. 1700 Montgomery Street, San Francisco, California 94111, United States
| | - Krassnigg Andreas
- Enzyme
and Drug Discovery, Innophore. 1700 Montgomery Street, San Francisco, California 94111, United States
| | - Kummer Arkadij
- Moderna,
Inc., 200 Technology
Square, Cambridge, Massachusetts 02139, United States
| | - Lutz Stefan
- Codexis
Inc., 200 Penobscot Drive, Redwood City, California 94063, United States
| | - Oberdorfer Gustav
- Department
of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010 Graz, Austria
| | - Siirola Elina
- Novartis
Institute for Biomedical Research, Global Discovery Chemistry, Basel CH-4108, Switzerland
| | - Snajdrova Radka
- Novartis
Institute for Biomedical Research, Global Discovery Chemistry, Basel CH-4108, Switzerland
| |
Collapse
|
9
|
Chen L, Fan Z, Chang J, Yang R, Hou H, Guo H, Zhang Y, Yang T, Zhou C, Sui Q, Chen Z, Zheng C, Hao X, Zhang K, Cui R, Zhang Z, Ma H, Ding Y, Zhang N, Lu X, Luo X, Jiang H, Zhang S, Zheng M. Sequence-based drug design as a concept in computational drug design. Nat Commun 2023; 14:4217. [PMID: 37452028 PMCID: PMC10349078 DOI: 10.1038/s41467-023-39856-w] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 06/27/2023] [Indexed: 07/18/2023] Open
Abstract
Drug development based on target proteins has been a successful approach in recent decades. However, the conventional structure-based drug design (SBDD) pipeline is a complex, human-engineered process with multiple independently optimized steps. Here, we propose a sequence-to-drug concept for computational drug design based on protein sequence information by end-to-end differentiable learning. We validate this concept in three stages. First, we design TransformerCPI2.0 as a core tool for the concept, which demonstrates generalization ability across proteins and compounds. Second, we interpret the binding knowledge that TransformerCPI2.0 learned. Finally, we use TransformerCPI2.0 to discover new hits for challenging drug targets, and identify new target for an existing drug based on an inverse application of the concept. Overall, this proof-of-concept study shows that the sequence-to-drug concept adds a perspective on drug design. It can serve as an alternative method to SBDD, particularly for proteins that do not yet have high-quality 3D structures available.
Collapse
Affiliation(s)
- Lifan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Zisheng Fan
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, 138 Xianlin Road, Jiangsu, Nanjing, 210023, China
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, No. 393 Huaxia Middle Road, Shanghai, 200031, China
| | - Jie Chang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, 138 Xianlin Road, Jiangsu, Nanjing, 210023, China
| | - Ruirui Yang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, No. 393 Huaxia Middle Road, Shanghai, 200031, China
| | - Hui Hou
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Hao Guo
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Yinghui Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Tianbiao Yang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Chenmao Zhou
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, 138 Xianlin Road, Jiangsu, Nanjing, 210023, China
| | - Qibang Sui
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Zhengyang Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Chen Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Xinyue Hao
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, 138 Xianlin Road, Jiangsu, Nanjing, 210023, China
| | - Keke Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, 138 Xianlin Road, Jiangsu, Nanjing, 210023, China
| | - Rongrong Cui
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Zehong Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Hudson Ma
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Yiluan Ding
- Department of Analytical Chemistry, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Naixia Zhang
- Department of Analytical Chemistry, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Xiaojie Lu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Xiaomin Luo
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Hualiang Jiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, 138 Xianlin Road, Jiangsu, Nanjing, 210023, China
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, No. 393 Huaxia Middle Road, Shanghai, 200031, China
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, 1 Sub-lane Xiangshan, Hangzhou, 310024, China
| | - Sulin Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China.
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China.
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, 138 Xianlin Road, Jiangsu, Nanjing, 210023, China.
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, No. 393 Huaxia Middle Road, Shanghai, 200031, China.
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, 1 Sub-lane Xiangshan, Hangzhou, 310024, China.
| |
Collapse
|
10
|
Dou B, Zhu Z, Merkurjev E, Ke L, Chen L, Jiang J, Zhu Y, Liu J, Zhang B, Wei GW. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem Rev 2023; 123:8736-8780. [PMID: 37384816 PMCID: PMC10999174 DOI: 10.1021/acs.chemrev.3c00189] [Citation(s) in RCA: 79] [Impact Index Per Article: 39.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023]
Abstract
Small data are often used in scientific and engineering research due to the presence of various constraints, such as time, cost, ethics, privacy, security, and technical limitations in data acquisition. However, big data have been the focus for the past decade, small data and their challenges have received little attention, even though they are technically more severe in machine learning (ML) and deep learning (DL) studies. Overall, the small data challenge is often compounded by issues, such as data diversity, imputation, noise, imbalance, and high-dimensionality. Fortunately, the current big data era is characterized by technological breakthroughs in ML, DL, and artificial intelligence (AI), which enable data-driven scientific discovery, and many advanced ML and DL technologies developed for big data have inadvertently provided solutions for small data problems. As a result, significant progress has been made in ML and DL for small data challenges in the past decade. In this review, we summarize and analyze several emerging potential solutions to small data challenges in molecular science, including chemical and biological sciences. We review both basic machine learning algorithms, such as linear regression, logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), kernel learning (KL), random forest (RF), and gradient boosting trees (GBT), and more advanced techniques, including artificial neural network (ANN), convolutional neural network (CNN), U-Net, graph neural network (GNN), Generative Adversarial Network (GAN), long short-term memory (LSTM), autoencoder, transformer, transfer learning, active learning, graph-based semi-supervised learning, combining deep learning with traditional machine learning, and physical model-based data augmentation. We also briefly discuss the latest advances in these methods. Finally, we conclude the survey with a discussion of promising trends in small data challenges in molecular science.
Collapse
Affiliation(s)
- Bozheng Dou
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Zailiang Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Ekaterina Merkurjev
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Lu Ke
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Long Chen
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jie Liu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
11
|
Chandra A, Tünnermann L, Löfstedt T, Gratz R. Transformer-based deep learning for predicting protein properties in the life sciences. eLife 2023; 12:e82819. [PMID: 36651724 PMCID: PMC9848389 DOI: 10.7554/elife.82819] [Citation(s) in RCA: 43] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 01/06/2023] [Indexed: 01/19/2023] Open
Abstract
Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough in life science applications, in particular in protein property prediction. There is hope that deep learning can close the gap between the number of sequenced proteins and proteins with known properties based on lab experiments. Language models from the field of natural language processing have gained popularity for protein property predictions and have led to a new computational revolution in biology, where old prediction results are being improved regularly. Such models can learn useful multipurpose representations of proteins from large open repositories of protein sequences and can be used, for instance, to predict protein properties. The field of natural language processing is growing quickly because of developments in a class of models based on a particular model-the Transformer model. We review recent developments and the use of large-scale Transformer models in applications for predicting protein characteristics and how such models can be used to predict, for example, post-translational modifications. We review shortcomings of other deep learning models and explain how the Transformer models have quickly proven to be a very promising way to unravel information hidden in the sequences of amino acids.
Collapse
Affiliation(s)
- Abel Chandra
- Department of Computing Science, Umeå UniversityUmeåSweden
| | - Laura Tünnermann
- Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural SciencesUmeåSweden
| | - Tommy Löfstedt
- Department of Computing Science, Umeå UniversityUmeåSweden
| | - Regina Gratz
- Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural SciencesUmeåSweden
- Department of Forest Ecology and Management, Swedish University of Agricultural SciencesUmeåSweden
| |
Collapse
|
12
|
Cai T, Xie L, Zhang S, Chen M, He D, Badkul A, Liu Y, Namballa HK, Dorogan M, Harding WW, Mura C, Bourne PE, Xie L. End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins. PLoS Comput Biol 2023; 19:e1010851. [PMID: 36652496 PMCID: PMC9886305 DOI: 10.1371/journal.pcbi.1010851] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 01/30/2023] [Accepted: 01/05/2023] [Indexed: 01/19/2023] Open
Abstract
Systematically discovering protein-ligand interactions across the entire human and pathogen genomes is critical in chemical genomics, protein function prediction, drug discovery, and many other areas. However, more than 90% of gene families remain "dark"-i.e., their small-molecule ligands are undiscovered due to experimental limitations or human/historical biases. Existing computational approaches typically fail when the dark protein differs from those with known ligands. To address this challenge, we have developed a deep learning framework, called PortalCG, which consists of four novel components: (i) a 3-dimensional ligand binding site enhanced sequence pre-training strategy to encode the evolutionary links between ligand-binding sites across gene families; (ii) an end-to-end pretraining-fine-tuning strategy to reduce the impact of inaccuracy of predicted structures on function predictions by recognizing the sequence-structure-function paradigm; (iii) a new out-of-cluster meta-learning algorithm that extracts and accumulates information learned from predicting ligands of distinct gene families (meta-data) and applies the meta-data to a dark gene family; and (iv) a stress model selection step, using different gene families in the test data from those in the training and development data sets to facilitate model deployment in a real-world scenario. In extensive and rigorous benchmark experiments, PortalCG considerably outperformed state-of-the-art techniques of machine learning and protein-ligand docking when applied to dark gene families, and demonstrated its generalization power for target identifications and compound screenings under out-of-distribution (OOD) scenarios. Furthermore, in an external validation for the multi-target compound screening, the performance of PortalCG surpassed the rational design from medicinal chemists. Our results also suggest that a differentiable sequence-structure-function deep learning framework, where protein structural information serves as an intermediate layer, could be superior to conventional methodology where predicted protein structures were used for the compound screening. We applied PortalCG to two case studies to exemplify its potential in drug discovery: designing selective dual-antagonists of dopamine receptors for the treatment of opioid use disorder (OUD), and illuminating the understudied human genome for target diseases that do not yet have effective and safe therapeutics. Our results suggested that PortalCG is a viable solution to the OOD problem in exploring understudied regions of protein functional space.
Collapse
Affiliation(s)
- Tian Cai
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, New York, United States of America
| | - Li Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
| | - Shuo Zhang
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, New York, United States of America
| | - Muge Chen
- Master Program in Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, New York, United States of America
| | - Di He
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, New York, United States of America
| | - Amitesh Badkul
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
| | - Yang Liu
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
| | - Hari Krishna Namballa
- Department of Chemistry, Hunter College, The City University of New York, New York, New York, United States of America
| | - Michael Dorogan
- Department of Chemistry, Hunter College, The City University of New York, New York, New York, United States of America
| | - Wayne W. Harding
- Department of Chemistry, Hunter College, The City University of New York, New York, New York, United States of America
| | - Cameron Mura
- School of Data Science & Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, United States of America
| | - Philip E. Bourne
- School of Data Science & Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, United States of America
| | - Lei Xie
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, New York, United States of America
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
- Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, New York, United States of America
| |
Collapse
|
13
|
Lin S, Shi C, Chen J. GeneralizedDTA: combining pre-training and multi-task learning to predict drug-target binding affinity for unknown drug discovery. BMC Bioinformatics 2022; 23:367. [PMID: 36071406 PMCID: PMC9449940 DOI: 10.1186/s12859-022-04905-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 08/23/2022] [Indexed: 12/04/2022] Open
Abstract
Background Accurately predicting drug-target binding affinity (DTA) in silico plays an important role in drug discovery. Most of the computational methods developed for predicting DTA use machine learning models, especially deep neural networks, and depend on large-scale labelled data. However, it is difficult to learn enough feature representation from tens of millions of compounds and hundreds of thousands of proteins only based on relatively limited labelled drug-target data. There are a large number of unknown drugs, which never appear in the labelled drug-target data. This is a kind of out-of-distribution problems in bio-medicine. Some recent studies adopted self-supervised pre-training tasks to learn structural information of amino acid sequences for enhancing the feature representation of proteins. However, the task gap between pre-training and DTA prediction brings the catastrophic forgetting problem, which hinders the full application of feature representation in DTA prediction and seriously affects the generalization capability of models for unknown drug discovery. Results To address these problems, we propose the GeneralizedDTA, which is a new DTA prediction model oriented to unknown drug discovery, by combining pre-training and multi-task learning. We introduce self-supervised protein and drug pre-training tasks to learn richer structural information from amino acid sequences of proteins and molecular graphs of drug compounds, in order to alleviate the problem of high variance caused by encoding based on deep neural networks and accelerate the convergence of prediction model on small-scale labelled data. We also develop a multi-task learning framework with a dual adaptation mechanism to narrow the task gap between pre-training and prediction for preventing overfitting and improving the generalization capability of DTA prediction model on unknown drug discovery. To validate the effectiveness of our model, we construct an unknown drug data set to simulate the scenario of unknown drug discovery. Compared with existing DTA prediction models, the experimental results show that our model has the higher generalization capability in the DTA prediction of unknown drugs. Conclusions The advantages of our model are mainly attributed to two kinds of pre-training tasks and the multi-task learning framework, which can learn richer structural information of proteins and drugs from large-scale unlabeled data, and then effectively integrate it into the downstream prediction task for obtaining a high-quality DTA prediction in unknown drug discovery.
Collapse
Affiliation(s)
- Shaofu Lin
- Faculty of Information Technology, Beijing University of Technology, No. 100, Pingleyuan, Chaoyang District, Beijing, 100124, China
| | - Chengyu Shi
- Faculty of Information Technology, Beijing University of Technology, No. 100, Pingleyuan, Chaoyang District, Beijing, 100124, China
| | - Jianhui Chen
- Faculty of Information Technology, Beijing University of Technology, No. 100, Pingleyuan, Chaoyang District, Beijing, 100124, China. .,Beijing International Collaboration Base on Brain Informatics and Wisdom Services, Beijing University of Technology, No. 100, Pingleyuan, Chaoyang District, Beijing, 100124, China. .,Beijing Key Laboratory of MRI and Brain Informatics, Beijing University Of Technology, No. 100, Pingleyuan, Chaoyang District, Beijing, 100124, China.
| |
Collapse
|
14
|
Tan RK, Liu Y, Xie L. Reinforcement learning for systems pharmacology-oriented and personalized drug design. Expert Opin Drug Discov 2022; 17:849-863. [PMID: 35510835 PMCID: PMC9824901 DOI: 10.1080/17460441.2022.2072288] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
INTRODUCTION Many multi-genic systemic diseases such as neurological disorders, inflammatory diseases, and the majority of cancers do not have effective treatments yet. Reinforcement learning powered systems pharmacology is a potentially effective approach to designing personalized therapies for untreatable complex diseases. AREAS COVERED In this survey, state-of-the-art reinforcement learning methods and their latest applications to drug design are reviewed. The challenges on harnessing reinforcement learning for systems pharmacology and personalized medicine are discussed. Potential solutions to overcome the challenges are proposed. EXPERT OPINION In spite of successful application of advanced reinforcement learning techniques to target-based drug discovery, new reinforcement learning strategies are needed to address systems pharmacology-oriented personalized de novo drug design.
Collapse
Affiliation(s)
- Ryan K. Tan
- Department of Computer Science, Hunter College, The City University of New York
| | - Yang Liu
- Department of Computer Science, Hunter College, The City University of New York
| | - Lei Xie
- Department of Computer Science, Hunter College, The City University of New York,Ph.D. Program in Computer Science, Biology & Biochemistry, The Graduate Center, The City University of New York,Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University,Correspondence should be addressed to Lei Xie -
| |
Collapse
|
15
|
Cai T, Abbu KA, Liu Y, Xie L. DeepREAL: A Deep Learning Powered Multi-scale Modeling Framework for Predicting Out-of-distribution Ligand-induced GPCR Activity. Bioinformatics 2022; 38:2561-2570. [PMID: 35274689 PMCID: PMC9048666 DOI: 10.1093/bioinformatics/btac154] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 02/18/2022] [Accepted: 03/10/2022] [Indexed: 11/20/2022] Open
Abstract
Motivation Drug discovery has witnessed intensive exploration of predictive modeling of drug–target physical interactions over two decades. However, a critical knowledge gap needs to be filled for correlating drug–target interactions with clinical outcomes: predicting genome-wide receptor activities or function selectivity, especially agonist versus antagonist, induced by novel chemicals. Two major obstacles compound the difficulty on this task: known data of receptor activity is far too scarce to train a robust model in light of genome-scale applications, and real-world applications need to deploy a model on data from various shifted distributions. Results To address these challenges, we have developed an end-to-end deep learning framework, DeepREAL, for multi-scale modeling of genome-wide ligand-induced receptor activities. DeepREAL utilizes self-supervised learning on tens of millions of protein sequences and pre-trained binary interaction classification to solve the data distribution shift and data scarcity problems. Extensive benchmark studies on G-protein coupled receptors (GPCRs), which simulate real-world scenarios, demonstrate that DeepREAL achieves state-of-the-art performances in out-of-distribution settings. DeepREAL can be extended to other gene families beyond GPCRs. Availability and implementation All data used are downloaded from Pfam (Mistry et al., 2020), GLASS (Chan et al., 2015) and IUPHAR/BPS and the data from reference (Sakamuru et al., 2021). Readers are directed to their official website for original data. Code is available on GitHub https://github.com/XieResearchGroup/DeepREAL. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tian Cai
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, 10016, USA
| | - Kyra Alyssa Abbu
- Department of Computer Science, Hunter College, The City University of New York, New York, 10065, USA
| | - Yang Liu
- Department of Computer Science, Hunter College, The City University of New York, New York, 10065, USA
| | - Lei Xie
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, 10016, USA.,Department of Computer Science, Hunter College, The City University of New York, New York, 10065, USA.,Helen and Robert Appel Alzheimer's Disease Research Institute,Feil Family Brain & Mind Research Institute,Weill Cornell Medicine,Cornell University, New York, 10021, USA
| |
Collapse
|
16
|
Lee I, Nam H. Sequence-based prediction of protein binding regions and drug-target interactions. J Cheminform 2022; 14:5. [PMID: 35135622 PMCID: PMC8822694 DOI: 10.1186/s13321-022-00584-w] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 01/20/2022] [Indexed: 12/19/2022] Open
Abstract
Identifying drug-target interactions (DTIs) is important for drug discovery. However, searching all drug-target spaces poses a major bottleneck. Therefore, recently many deep learning models have been proposed to address this problem. However, the developers of these deep learning models have neglected interpretability in model construction, which is closely related to a model's performance. We hypothesized that training a model to predict important regions on a protein sequence would increase DTI prediction performance and provide a more interpretable model. Consequently, we constructed a deep learning model, named Highlights on Target Sequences (HoTS), which predicts binding regions (BRs) between a protein sequence and a drug ligand, as well as DTIs between them. To train the model, we collected complexes of protein-ligand interactions and protein sequences of binding sites and pretrained the model to predict BRs for a given protein sequence-ligand pair via object detection employing transformers. After pretraining the BR prediction, we trained the model to predict DTIs from a compound token designed to assign attention to BRs. We confirmed that training the BRs prediction model indeed improved the DTI prediction performance. The proposed HoTS model showed good performance in BR prediction on independent test datasets even though it does not use 3D structure information in its prediction. Furthermore, the HoTS model achieved the best performance in DTI prediction on test datasets. Additional analysis confirmed the appropriate attention for BRs and the importance of transformers in BR and DTI prediction. The source code is available on GitHub ( https://github.com/GIST-CSBL/HoTS ).
Collapse
Affiliation(s)
- Ingoo Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Buk-ku, Gwangju, 61005 Republic of Korea
| | - Hojung Nam
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Buk-ku, Gwangju, 61005 Republic of Korea
| |
Collapse
|
17
|
Cai T, Xie L, Chen M, Liu Y, He D, Zhang S, Mura C, Bourne PE, Xie L. Exploration of Dark Chemical Genomics Space via Portal Learning: Applied to Targeting the Undruggable Genome and COVID-19 Anti-Infective Polypharmacology. RESEARCH SQUARE 2021:rs.3.rs-1109318. [PMID: 34873596 PMCID: PMC8647653 DOI: 10.21203/rs.3.rs-1109318/v1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Advances in biomedicine are largely fueled by exploring uncharted territories of human biology. Machine learning can both enable and accelerate discovery, but faces a fundamental hurdle when applied to unseen data with distributions that differ from previously observed ones-a common dilemma in scientific inquiry. We have developed a new deep learning framework, called Portal Learning, to explore dark chemical and biological space. Three key, novel components of our approach include: (i) end-to-end, step-wise transfer learning, in recognition of biology's sequence-structure-function paradigm, (ii) out-of-cluster meta-learning, and (iii) stress model selection. Portal Learning provides a practical solution to the out-of-distribution (OOD) problem in statistical machine learning. Here, we have implemented Portal Learning to predict chemical-protein interactions on a genome-wide scale. Systematic studies demonstrate that Portal Learning can effectively assign ligands to unexplored gene families (unknown functions), versus existing state-of-the-art methods. Compared with AlphaFold2-based protein-ligand docking, Portal Learning significantly improved the performance by 79% in PR-AUC and 27% in ROC-AUC, respectively. The superior performance of Portal Learning allowed us to target previously "undruggable" proteins and design novel polypharmacological agents for disrupting interactions between SARS-CoV-2 and human proteins. Portal Learning is general-purpose and can be further applied to other areas of scientific inquiry.
Collapse
Affiliation(s)
- Tian Cai
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, 10016, USA
| | - Li Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, 10065, USA
| | - Muge Chen
- Master Program in Computer Science, Courant Institute of Mathematical Sciences, New York University
| | - Yang Liu
- Department of Computer Science, Hunter College, The City University of New York, New York, 10065, USA
| | - Di He
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, 10016, USA
| | - Shuo Zhang
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, 10016, USA
| | - Cameron Mura
- School of Data Science & Department of Biomedical Engineering, University of Virginia, Virginia, 22903, USA
| | - Philip E. Bourne
- School of Data Science & Department of Biomedical Engineering, University of Virginia, Virginia, 22903, USA
| | - Lei Xie
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, 10016, USA
- Department of Computer Science, Hunter College, The City University of New York, New York, 10065, USA
- Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, 10021, USA
| |
Collapse
|