1
|
Joeres R, Blumenthal DB, Kalinina OV. Data splitting to avoid information leakage with DataSAIL. Nat Commun 2025; 16:3337. [PMID: 40199913 PMCID: PMC11978981 DOI: 10.1038/s41467-025-58606-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 03/28/2025] [Indexed: 04/10/2025] Open
Abstract
Information leakage is an increasingly important topic in machine learning research for biomedical applications. When information leakage happens during a model's training, it risks memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time. We present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of machine learning models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. Finally, we empirically demonstrate DataSAIL's impact on evaluating biomedical machine learning models.
Collapse
Affiliation(s)
- Roman Joeres
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany.
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany.
- Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, Sweden.
- Wallenberg Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg, Sweden.
| | - David B Blumenthal
- Department Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Olga V Kalinina
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
- Medical Faculty, Saarland University, Homburg, Germany
| |
Collapse
|
2
|
Howladar N, Kabir MWU, Hoque F, Katebi A, Hoque MT. PPILS: Protein-protein interaction prediction with language of biological coding. Comput Biol Med 2025; 186:109678. [PMID: 39832439 DOI: 10.1016/j.compbiomed.2025.109678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 01/03/2025] [Accepted: 01/12/2025] [Indexed: 01/22/2025]
Abstract
Protein-protein interactions within a cell are essential for various fundamental biological processes. Computational techniques have arisen in bioinformatics due to the challenging and resource-intensive nature of experimental protein pair interaction studies. This research seeks to create a cutting-edge machine learning method for predicting protein pair interactions using carefully chosen input features and leveraging evolutionary data. PPILS leverages evolutionary knowledge from the protein language model. It develops an encoder-decoder architecture with light attention. The trained model obtains protein embeddings from a language model and employs a light attention-based encoder, where a single convolution operation generates attention. A subsequent convolution is applied to input features, creating a representative construct for the protein interaction prediction. These encoded representations are then channeled into the decoder to predict protein interactions. Our findings indicated that PPILS outperformed existing methods in PPI prediction. The proposed method could be essential in protein-protein interaction prediction, further accelerating the discovery of protein-based drugs.
Collapse
Affiliation(s)
- Nayan Howladar
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| | - Md Wasi Ul Kabir
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| | - Foyzul Hoque
- Department of Computer Science & Engineering, Independent University, Bangladesh.
| | - Ataur Katebi
- Department of Bioengineering, Northeastern University, Boston, MA, USA; Center for Theoretical Biological Physics, Northeastern University, Boston, MA, USA.
| | - Md Tamjidul Hoque
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| |
Collapse
|
3
|
Heinzinger M, Rost B. Artificial Intelligence Learns Protein Prediction. Cold Spring Harb Perspect Biol 2024; 16:a041458. [PMID: 38858069 PMCID: PMC11368192 DOI: 10.1101/cshperspect.a041458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2024]
Abstract
From AlphaGO over StableDiffusion to ChatGPT, the recent decade of exponential advances in artificial intelligence (AI) has been altering life. In parallel, advances in computational biology are beginning to decode the language of life: AlphaFold2 leaped forward in protein structure prediction, and protein language models (pLMs) replaced expertise and evolutionary information from multiple sequence alignments with information learned from reoccurring patterns in databases of billions of proteins without experimental annotations other than the amino acid sequences. None of those tools could have been developed 10 years ago; all will increase the wealth of experimental data and speed up the cycle from idea to proof. AI is affecting molecular and medical biology at giant steps, and the most important might be the leap toward more powerful protein design.
Collapse
Affiliation(s)
- Michael Heinzinger
- Technical University of Munich (TUM) School of School of Computation, Information and Technology (CIT), Bioinformatics and Computational Biology - i12, 85748 Garching/Munich, Germany
| | - Burkhard Rost
- Technical University of Munich (TUM) School of School of Computation, Information and Technology (CIT), Bioinformatics and Computational Biology - i12, 85748 Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), 85748 Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), 85354 Freising, Germany
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| |
Collapse
|
4
|
Zitnik M, Li MM, Wells A, Glass K, Morselli Gysi D, Krishnan A, Murali TM, Radivojac P, Roy S, Baudot A, Bozdag S, Chen DZ, Cowen L, Devkota K, Gitter A, Gosline SJC, Gu P, Guzzi PH, Huang H, Jiang M, Kesimoglu ZN, Koyuturk M, Ma J, Pico AR, Pržulj N, Przytycka TM, Raphael BJ, Ritz A, Sharan R, Shen Y, Singh M, Slonim DK, Tong H, Yang XH, Yoon BJ, Yu H, Milenković T. Current and future directions in network biology. BIOINFORMATICS ADVANCES 2024; 4:vbae099. [PMID: 39143982 PMCID: PMC11321866 DOI: 10.1093/bioadv/vbae099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 05/31/2024] [Accepted: 07/08/2024] [Indexed: 08/16/2024]
Abstract
Summary Network biology is an interdisciplinary field bridging computational and biological sciences that has proved pivotal in advancing the understanding of cellular functions and diseases across biological systems and scales. Although the field has been around for two decades, it remains nascent. It has witnessed rapid evolution, accompanied by emerging challenges. These stem from various factors, notably the growing complexity and volume of data together with the increased diversity of data types describing different tiers of biological organization. We discuss prevailing research directions in network biology, focusing on molecular/cellular networks but also on other biological network types such as biomedical knowledge graphs, patient similarity networks, brain networks, and social/contact networks relevant to disease spread. In more detail, we highlight areas of inference and comparison of biological networks, multimodal data integration and heterogeneous networks, higher-order network analysis, machine learning on networks, and network-based personalized medicine. Following the overview of recent breakthroughs across these five areas, we offer a perspective on future directions of network biology. Additionally, we discuss scientific communities, educational initiatives, and the importance of fostering diversity within the field. This article establishes a roadmap for an immediate and long-term vision for network biology. Availability and implementation Not applicable.
Collapse
Affiliation(s)
- Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| | - Michelle M Li
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| | - Aydin Wells
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
- Lucy Family Institute for Data and Society, University of Notre Dame, Notre Dame, IN 46556, United States
- Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Kimberly Glass
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, United States
| | - Deisy Morselli Gysi
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, United States
- Department of Statistics, Federal University of Paraná, Curitiba, Paraná 81530-015, Brazil
- Department of Physics, Northeastern University, Boston, MA 02115, United States
| | - Arjun Krishnan
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, United States
| | - T M Murali
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, United States
| | - Sushmita Roy
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715, United States
- Wisconsin Institute for Discovery, Madison, WI 53715, United States
| | - Anaïs Baudot
- Aix Marseille Université, INSERM, MMG, Marseille, France
| | - Serdar Bozdag
- Department of Computer Science and Engineering, University of North Texas, Denton, TX 76203, United States
- Department of Mathematics, University of North Texas, Denton, TX 76203, United States
| | - Danny Z Chen
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Lenore Cowen
- Department of Computer Science, Tufts University, Medford, MA 02155, United States
| | - Kapil Devkota
- Department of Computer Science, Tufts University, Medford, MA 02155, United States
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715, United States
- Morgridge Institute for Research, Madison, WI 53715, United States
| | - Sara J C Gosline
- Biological Sciences Division, Pacific Northwest National Laboratory, Seattle, WA 98109, United States
| | - Pengfei Gu
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Pietro H Guzzi
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Catanzaro, 88100, Italy
| | - Heng Huang
- Department of Computer Science, University of Maryland College Park, College Park, MD 20742, United States
| | - Meng Jiang
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Ziynet Nesibe Kesimoglu
- Department of Computer Science and Engineering, University of North Texas, Denton, TX 76203, United States
- National Center of Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20814, United States
| | - Mehmet Koyuturk
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Alexander R Pico
- Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA 94158, United States
| | - Nataša Pržulj
- Department of Computer Science, University College London, London, WC1E 6BT, England
- ICREA, Catalan Institution for Research and Advanced Studies, Barcelona, 08010, Spain
- Barcelona Supercomputing Center (BSC), Barcelona, 08034, Spain
| | - Teresa M Przytycka
- National Center of Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20814, United States
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ 08544, United States
| | - Anna Ritz
- Department of Biology, Reed College, Portland, OR 97202, United States
| | - Roded Sharan
- School of Computer Science, Tel Aviv University, Tel Aviv, 69978, Israel
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
| | - Mona Singh
- Department of Computer Science, Princeton University, Princeton, NJ 08544, United States
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, United States
| | - Donna K Slonim
- Department of Computer Science, Tufts University, Medford, MA 02155, United States
| | - Hanghang Tong
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| | - Xinan Holly Yang
- Department of Pediatrics, University of Chicago, Chicago, IL 60637, United States
| | - Byung-Jun Yoon
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
- Computational Science Initiative, Brookhaven National Laboratory, Upton, NY 11973, United States
| | - Haiyuan Yu
- Department of Computational Biology, Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, United States
| | - Tijana Milenković
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
- Lucy Family Institute for Data and Society, University of Notre Dame, Notre Dame, IN 46556, United States
- Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, United States
| |
Collapse
|
5
|
Szymborski J, Emad A. INTREPPPID-an orthologue-informed quintuplet network for cross-species prediction of protein-protein interaction. Brief Bioinform 2024; 25:bbae405. [PMID: 39171984 PMCID: PMC11339867 DOI: 10.1093/bib/bbae405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 07/25/2024] [Accepted: 07/31/2024] [Indexed: 08/23/2024] Open
Abstract
An overwhelming majority of protein-protein interaction (PPI) studies are conducted in a select few model organisms largely due to constraints in time and cost of the associated 'wet lab' experiments. In silico PPI inference methods are ideal tools to overcome these limitations, but often struggle with cross-species predictions. We present INTREPPPID, a method that incorporates orthology data using a new 'quintuplet' neural network, which is constructed with five parallel encoders with shared parameters. INTREPPPID incorporates both a PPI classification task and an orthologous locality task. The latter learns embeddings of orthologues that have small Euclidean distances between them and large distances between embeddings of all other proteins. INTREPPPID outperforms all other leading PPI inference methods tested on both the intraspecies and cross-species tasks using strict evaluation datasets. We show that INTREPPPID's orthologous locality loss increases performance because of the biological relevance of the orthologue data and not due to some other specious aspect of the architecture. Finally, we introduce PPI.bio and PPI Origami, a web server interface for INTREPPPID and a software tool for creating strict evaluation datasets, respectively. Together, these two initiatives aim to make both the use and development of PPI inference tools more accessible to the community.
Collapse
Affiliation(s)
- Joseph Szymborski
- Department of Electrical and Computer Engineering, McGill University, 845 Sherbrooke Street West, Montréal, QC H3A 0G4, Canada
- Mila, Quebec AI Institute, 6666 St-Urbain Street #200, Montréal, QC H2S 3H1, Canada
| | - Amin Emad
- Department of Electrical and Computer Engineering, McGill University, 845 Sherbrooke Street West, Montréal, QC H3A 0G4, Canada
- Mila, Quebec AI Institute, 6666 St-Urbain Street #200, Montréal, QC H2S 3H1, Canada
- The Rosalind and Morris Goodman Cancer Institute, 1160 Pine Avenue, Montréal, QC H3A 1A3, Canada
| |
Collapse
|
6
|
Huang C, Zhang L, Tang T, Wang H, Jiang Y, Ren H, Zhang Y, Fang J, Zhang W, Jia X, You S, Qin B. Application of Directed Evolution and Machine Learning to Enhance the Diastereoselectivity of Ketoreductase for Dihydrotetrabenazine Synthesis. JACS AU 2024; 4:2547-2556. [PMID: 39055154 PMCID: PMC11267543 DOI: 10.1021/jacsau.4c00284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 06/13/2024] [Accepted: 06/20/2024] [Indexed: 07/27/2024]
Abstract
Biocatalysis is an effective approach for producing chiral drug intermediates that are often difficult to synthesize using traditional chemical methods. A time-efficient strategy is required to accelerate the directed evolution process to achieve the desired enzyme function. In this research, we evaluated machine learning-assisted directed evolution as a potential approach for enzyme engineering, using a moderately diastereoselective ketoreductase library as a model system. Machine learning-assisted directed evolution and traditional directed evolution methods were compared for reducing (±)-tetrabenazine to dihydrotetrabenazine via kinetic resolution facilitated by BsSDR10, a short-chain dehydrogenase/reductase from Bacillus subtilis. Both methods successfully identified variants with significantly improved diastereoselectivity for each isomer of dihydrotetrabenazine. Furthermore, the preparation of (2S,3S,11bS)-dihydrotetrabenazine has been successfully scaled up, with an isolated yield of 40.7% and a diastereoselectivity of 91.3%.
Collapse
Affiliation(s)
- Chenming Huang
- Wuya
College of Innovation, Shenyang Pharmaceutical
University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Li Zhang
- Wuya
College of Innovation, Shenyang Pharmaceutical
University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Tong Tang
- Wuya
College of Innovation, Shenyang Pharmaceutical
University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Haijiao Wang
- Wuya
College of Innovation, Shenyang Pharmaceutical
University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Yingqian Jiang
- Wuya
College of Innovation, Shenyang Pharmaceutical
University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Hanwen Ren
- Wuya
College of Innovation, Shenyang Pharmaceutical
University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Yitian Zhang
- Wuya
College of Innovation, Shenyang Pharmaceutical
University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Jiali Fang
- Wuya
College of Innovation, Shenyang Pharmaceutical
University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Wenhe Zhang
- School
of Life Sciences and Biopharmaceutical Sciences, Shenyang Pharmaceutical University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Xian Jia
- School
of Pharmaceutical Engineering, Shenyang
Pharmaceutical University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Song You
- School
of Life Sciences and Biopharmaceutical Sciences, Shenyang Pharmaceutical University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| | - Bin Qin
- Wuya
College of Innovation, Shenyang Pharmaceutical
University, 103 Wenhua Road, Shenhe, Shenyang 110016, People’s Republic
of China
| |
Collapse
|
7
|
Bernett J, Blumenthal DB, List M. Cracking the black box of deep sequence-based protein-protein interaction prediction. Brief Bioinform 2024; 25:bbae076. [PMID: 38446741 PMCID: PMC10939362 DOI: 10.1093/bib/bbae076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 01/09/2024] [Indexed: 03/08/2024] Open
Abstract
Identifying protein-protein interactions (PPIs) is crucial for deciphering biological pathways. Numerous prediction methods have been developed as cheap alternatives to biological experiments, reporting surprisingly high accuracy estimates. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities and node degree information, and compared them with basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test set, performances become random. Moreover, baseline models directly leveraging sequence similarity and network topology show good performances at a fraction of the computational cost. Thus, we advocate that any improvements should be reported relative to baseline methods in the future. Our findings suggest that predicting PPIs remains an unsolved task for proteins showing little sequence similarity to previously studied proteins, highlighting that further experimental research into the 'dark' protein interactome and better computational methods are needed.
Collapse
Affiliation(s)
- Judith Bernett
- Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Maximus-von-Imhof Forum 3, 85354, Freising, Germany
| | - David B Blumenthal
- Biomedical Network Science Lab, Department Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-Universität Erlangen-Nürnberg, Werner-von-Siemens-Str. 61, 91052, Erlangen, Germany
| | - Markus List
- Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Maximus-von-Imhof Forum 3, 85354, Freising, Germany
| |
Collapse
|
8
|
Avalos-Pacheco A, Ventz S, Arfè A, Alexander BM, Rahman R, Wen PY, Trippa L. Validation of Predictive Analyses for Interim Decisions in Clinical Trials. JCO Precis Oncol 2023; 7:e2200606. [PMID: 36848613 PMCID: PMC10166373 DOI: 10.1200/po.22.00606] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 12/20/2022] [Accepted: 01/12/2023] [Indexed: 03/01/2023] Open
Abstract
PURPOSE Adaptive clinical trials use algorithms to predict, during the study, patient outcomes and final study results. These predictions trigger interim decisions, such as early discontinuation of the trial, and can change the course of the study. Poor selection of the Prediction Analyses and Interim Decisions (PAID) plan in an adaptive clinical trial can have negative consequences, including the risk of exposing patients to ineffective or toxic treatments. METHODS We present an approach that leverages data sets from completed trials to evaluate and compare candidate PAIDs using interpretable validation metrics. The goal is to determine whether and how to incorporate predictions into major interim decisions in a clinical trial. Candidate PAIDs can differ in several aspects, such as the prediction models used, timing of interim analyses, and potential use of external data sets. To illustrate our approach, we considered a randomized clinical trial in glioblastoma. The study design includes interim futility analyses on the basis of the predictive probability that the final analysis, at the completion of the study, will provide significant evidence of treatment effects. We examined various PAIDs with different levels of complexity to investigate if the use of biomarkers, external data, or novel algorithms improved interim decisions in the glioblastoma clinical trial. RESULTS Validation analyses on the basis of completed trials and electronic health records support the selection of algorithms, predictive models, and other aspects of PAIDs for use in adaptive clinical trials. By contrast, PAID evaluations on the basis of arbitrarily defined ad hoc simulation scenarios, which are not tailored to previous clinical data and experience, tend to overvalue complex prediction procedures and produce poor estimates of trial operating characteristics such as power and the number of enrolled patients. CONCLUSION Validation analyses on the basis of completed trials and real world data support the selection of predictive models, interim analysis rules, and other aspects of PAIDs in future clinical trials.
Collapse
Affiliation(s)
- Alejandra Avalos-Pacheco
- Applied Statistics Research Unit, Faculty of Mathematics and Geoinformation, TU Wien, Vienna, Austria
- Harvard-MIT Center for Regulatory Science, Harvard Medical School, Boston, MA
| | - Steffen Ventz
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN
| | - Andrea Arfè
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY
| | - Brian M. Alexander
- Dana-Farber Cancer Institute, Boston, MA
- Foundation Medicine, Cambridge, MA
| | - Rifaquat Rahman
- Dana-Farber Cancer Institute, Boston, MA
- Harvard Medical School, Boston, MA
| | - Patrick Y. Wen
- Center for Neuro-Oncology, Dana-Farber Cancer Institute, Boston, MA
| | - Lorenzo Trippa
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA
- Harvard T.H. Chan School of Public Health, Boston, MA
| |
Collapse
|
9
|
Rogers JR, Nikolényi G, AlQuraishi M. Growing ecosystem of deep learning methods for modeling protein-protein interactions. Protein Eng Des Sel 2023; 36:gzad023. [PMID: 38102755 DOI: 10.1093/protein/gzad023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 12/06/2023] [Accepted: 12/07/2023] [Indexed: 12/17/2023] Open
Abstract
Numerous cellular functions rely on protein-protein interactions. Efforts to comprehensively characterize them remain challenged however by the diversity of molecular recognition mechanisms employed within the proteome. Deep learning has emerged as a promising approach for tackling this problem by exploiting both experimental data and basic biophysical knowledge about protein interactions. Here, we review the growing ecosystem of deep learning methods for modeling protein interactions, highlighting the diversity of these biophysically informed models and their respective trade-offs. We discuss recent successes in using representation learning to capture complex features pertinent to predicting protein interactions and interaction sites, geometric deep learning to reason over protein structures and predict complex structures, and generative modeling to design de novo protein assemblies. We also outline some of the outstanding challenges and promising new directions. Opportunities abound to discover novel interactions, elucidate their physical mechanisms, and engineer binders to modulate their functions using deep learning and, ultimately, unravel how protein interactions orchestrate complex cellular behaviors.
Collapse
Affiliation(s)
- Julia R Rogers
- Department of Systems Biology, Columbia University, New York, NY 10032, USA
| | - Gergő Nikolényi
- Department of Systems Biology, Columbia University, New York, NY 10032, USA
| | | |
Collapse
|
10
|
Kang Y, Xu Y, Wang X, Pu B, Yang X, Rao Y, Chen J. HN-PPISP: a hybrid network based on MLP-Mixer for protein-protein interaction site prediction. Brief Bioinform 2023; 24:6833645. [PMID: 36403092 DOI: 10.1093/bib/bbac480] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 09/16/2022] [Accepted: 10/09/2022] [Indexed: 11/21/2022] Open
Abstract
MOTIVATION Biological experimental approaches to protein-protein interaction (PPI) site prediction are critical for understanding the mechanisms of biochemical processes but are time-consuming and laborious. With the development of Deep Learning (DL) techniques, the most popular Convolutional Neural Networks (CNN)-based methods have been proposed to address these problems. Although significant progress has been made, these methods still have limitations in encoding the characteristics of each amino acid in protein sequences. Current methods cannot efficiently explore the nature of Position Specific Scoring Matrix (PSSM), secondary structure and raw protein sequences by processing them all together. For PPI site prediction, how to effectively model the PPI context with attention to prediction remains an open problem. In addition, the long-distance dependencies of PPI features are important, which is very challenging for many CNN-based methods because the innate ability of CNN is difficult to outperform auto-regressive models like Transformers. RESULTS To effectively mine the properties of PPI features, a novel hybrid neural network named HN-PPISP is proposed, which integrates a Multi-layer Perceptron Mixer (MLP-Mixer) module for local feature extraction and a two-stage multi-branch module for global feature capture. The model merits Transformer, TextCNN and Bi-LSTM as a powerful alternative for PPI site prediction. On the one hand, this is the first application of an advanced Transformer (i.e. MLP-Mixer) with a hybrid network for sequence-based PPI prediction. On the other hand, unlike existing methods that treat global features altogether, the proposed two-stage multi-branch hybrid module firstly assigns different attention scores to the input features and then encodes the feature through different branch modules. In the first stage, different improved attention modules are hybridized to extract features from the raw protein sequences, secondary structure and PSSM, respectively. In the second stage, a multi-branch network is designed to aggregate information from both branches in parallel. The two branches encode the features and extract dependencies through several operations such as TextCNN, Bi-LSTM and different activation functions. Experimental results on real-world public datasets show that our model consistently achieves state-of-the-art performance over seven remarkable baselines. AVAILABILITY The source code of HN-PPISP model is available at https://github.com/ylxu05/HN-PPISP.
Collapse
Affiliation(s)
- Yan Kang
- National Pilot School of Software, Yunnan University, Kunming, 650091, P.R. China
| | - Yulong Xu
- National Pilot School of Software, Yunnan University, Kunming, 650091, P.R. China
| | - Xinchao Wang
- National Pilot School of Software, Yunnan University, Kunming, 650091, P.R. China
| | - Bin Pu
- College of Computer Science and Electronic Engineeringg, Hunan University, Changsha, 410082, P.R. China
| | - Xuekun Yang
- National Pilot School of Software, Yunnan University, Kunming, 650091, P.R. China
| | - Yulong Rao
- National Pilot School of Software, Yunnan University, Kunming, 650091, P.R. China
| | - Jianguo Chen
- School of Software Engineering, Sun Yat-Sen University, Zhuhai, 519082, P.R. China
| |
Collapse
|
11
|
Karpuzcu BA, Türk E, Ibrahim AH, Karabulut OC, Süzek BE. Machine Learning Methods for Virus-Host Protein-Protein Interaction Prediction. Methods Mol Biol 2023; 2690:401-417. [PMID: 37450162 DOI: 10.1007/978-1-0716-3327-4_31] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/18/2023]
Abstract
The attachment of a virion to a respective cellular receptor on the host organism occurring through the virus-host protein-protein interactions (PPIs) is a decisive step for viral pathogenicity and infectivity. Therefore, a vast number of wet-lab experimental techniques are used to study virus-host PPIs. Taking the great number and enormous variety of virus-host PPIs and the cost as well as labor of laboratory work, however, computational approaches toward analyzing the available interaction data and predicting previously unidentified interactions have been on the rise. Among them, machine-learning-based models are getting increasingly more attention with a great body of resources and tools proposed recently.In this chapter, we first provide the methodology with major steps toward the development of a virus-host PPI prediction tool. Next, we discuss the challenges involved and evaluate several existing machine-learning-based virus-host PPI prediction tools. Finally, we describe our experience with several ensemble techniques as utilized on available prediction results retrieved from individual PPI prediction tools. Overall, based on our experience, we recognize there is still room for the development of new individual and/or ensemble virus-host PPI prediction tools that leverage existing tools.
Collapse
Affiliation(s)
- Betül Asiye Karpuzcu
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Erdem Türk
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
- Department of Computer Engineering, Faculty of Engineering, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Ahmad Hassan Ibrahim
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Onur Can Karabulut
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey
| | - Barış Ethem Süzek
- Bioinformatics Graduate Program, Graduate School of Natural and Applied Sciences, Muğla Sıtkı Koçman University, Muğla, Turkey.
- Department of Computer Engineering, Faculty of Engineering, Muğla Sıtkı Koçman University, Muğla, Turkey.
| |
Collapse
|
12
|
Guo Z, Yamaguchi R. Machine learning methods for protein-protein binding affinity prediction in protein design. FRONTIERS IN BIOINFORMATICS 2022; 2:1065703. [PMID: 36591334 PMCID: PMC9800603 DOI: 10.3389/fbinf.2022.1065703] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 12/01/2022] [Indexed: 12/23/2022] Open
Abstract
Protein-protein interactions govern a wide range of biological activity. A proper estimation of the protein-protein binding affinity is vital to design proteins with high specificity and binding affinity toward a target protein, which has a variety of applications including antibody design in immunotherapy, enzyme engineering for reaction optimization, and construction of biosensors. However, experimental and theoretical modelling methods are time-consuming, hinder the exploration of the entire protein space, and deter the identification of optimal proteins that meet the requirements of practical applications. In recent years, the rapid development in machine learning methods for protein-protein binding affinity prediction has revealed the potential of a paradigm shift in protein design. Here, we review the prediction methods and associated datasets and discuss the requirements and construction methods of binding affinity prediction models for protein design.
Collapse
Affiliation(s)
- Zhongliang Guo
- Division of Cancer Systems Biology, Aichi Cancer Center Research Institute, Nagoya, Aichi, Japan
| | - Rui Yamaguchi
- Division of Cancer Systems Biology, Aichi Cancer Center Research Institute, Nagoya, Aichi, Japan,Division of Cancer Informatics, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan,*Correspondence: Rui Yamaguchi,
| |
Collapse
|
13
|
Ilzhöfer D, Heinzinger M, Rost B. SETH predicts nuances of residue disorder from protein embeddings. FRONTIERS IN BIOINFORMATICS 2022; 2:1019597. [PMID: 36304335 PMCID: PMC9580958 DOI: 10.3389/fbinf.2022.1019597] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 09/20/2022] [Indexed: 11/07/2022] Open
Abstract
Predictions for millions of protein three-dimensional structures are only a few clicks away since the release of AlphaFold2 results for UniProt. However, many proteins have so-called intrinsically disordered regions (IDRs) that do not adopt unique structures in isolation. These IDRs are associated with several diseases, including Alzheimer's Disease. We showed that three recent disorder measures of AlphaFold2 predictions (pLDDT, "experimentally resolved" prediction and "relative solvent accessibility") correlated to some extent with IDRs. However, expert methods predict IDRs more reliably by combining complex machine learning models with expert-crafted input features and evolutionary information from multiple sequence alignments (MSAs). MSAs are not always available, especially for IDRs, and are computationally expensive to generate, limiting the scalability of the associated tools. Here, we present the novel method SETH that predicts residue disorder from embeddings generated by the protein Language Model ProtT5, which explicitly only uses single sequences as input. Thereby, our method, relying on a relatively shallow convolutional neural network, outperformed much more complex solutions while being much faster, allowing to create predictions for the human proteome in about 1 hour on a consumer-grade PC with one NVIDIA GeForce RTX 3060. Trained on a continuous disorder scale (CheZOD scores), our method captured subtle variations in disorder, thereby providing important information beyond the binary classification of most methods. High performance paired with speed revealed that SETH's nuanced disorder predictions for entire proteomes capture aspects of the evolution of organisms. Additionally, SETH could also be used to filter out regions or proteins with probable low-quality AlphaFold2 3D structures to prioritize running the compute-intensive predictions for large data sets. SETH is freely publicly available at: https://github.com/Rostlab/SETH.
Collapse
Affiliation(s)
- Dagmar Ilzhöfer
- Faculty of Informatics, TUM (Technical University of Munich), Munich, Germany
| | - Michael Heinzinger
- Faculty of Informatics, TUM (Technical University of Munich), Munich, Germany
- Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), TUM Graduate School, Garching, Germany
| | - Burkhard Rost
- Faculty of Informatics, TUM (Technical University of Munich), Munich, Germany
- Institute for Advanced Study (TUM-IAS), TUM (Technical University of Munich), Garching, Germany
- TUM School of Life Sciences Weihenstephan (WZW), TUM (Technical University of Munich), Freising, Germany
| |
Collapse
|
14
|
Canzler S, Fischer M, Ulbricht D, Ristic N, Hildebrand PW, Staritzbichler R. ProteinPrompt: a webserver for predicting protein-protein interactions. BIOINFORMATICS ADVANCES 2022; 2:vbac059. [PMID: 36699419 PMCID: PMC9710678 DOI: 10.1093/bioadv/vbac059] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 07/19/2022] [Accepted: 08/14/2022] [Indexed: 01/28/2023]
Abstract
Motivation Protein-protein interactions (PPIs) play an essential role in a great variety of cellular processes and are therefore of significant interest for the design of new therapeutic compounds as well as the identification of side effects due to unexpected binding. Here, we present ProteinPrompt, a webserver that uses machine learning algorithms to calculate specific, currently unknown PPIs. Our tool is designed to quickly and reliably predict contact propensities based on an input sequence in order to scan large sequence libraries for potential binding partners, with the goal to accelerate and assure the quality of the laborious process of drug target identification. Results We collected and thoroughly filtered a comprehensive database of known binders from several sources, which is available as download. ProteinPrompt provides two complementary search methods of similar accuracy for comparison and consensus building. The default method is a random forest (RF) algorithm that uses the auto-correlations of seven amino acid scales. Alternatively, a graph neural network (GNN) implementation can be selected. Additionally, a consensus prediction is available. For each query sequence, potential binding partners are identified from a protein sequence database. The proteom of several organisms are available and can be searched for binders. To evaluate the predictive power of the algorithms, we prepared a test dataset that was rigorously filtered for redundancy. No sequence pairs similar to the ones used for training were included in this dataset. With this challenging dataset, the RF method achieved an accuracy rate of 0.88 and an area under the curve of 0.95. The GNN achieved an accuracy rate of 0.86 using the same dataset. Since the underlying learning approaches are unrelated, comparing the results of RF and GNNs reduces the likelihood of errors. The consensus reached an accuracy of 0.89. Availability and implementation ProteinPrompt is available online at: http://proteinformatics.org/ProteinPrompt, where training and test data used to optimize the methods are also available. The server makes it possible to scan the human proteome for potential binding partners of an input sequence within minutes. For local offline usage, we furthermore created a ProteinPrompt Docker image which allows for batch submission: https://gitlab.hzdr.de/proteinprompt/ProteinPrompt. In conclusion, we offer a fast, accurate, easy-to-use online service for predicting binding partners from an input sequence.
Collapse
Affiliation(s)
| | | | - David Ulbricht
- Institute of Medical Physics and Biophysics, University of Leipzig, 04107 Leipzig, Germany
| | - Nikola Ristic
- Institute of Medical Physics and Biophysics, University of Leipzig, 04107 Leipzig, Germany
| | - Peter W Hildebrand
- Institute of Medical Physics and Biophysics, University of Leipzig, 04107 Leipzig, Germany,Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Physics and Biophysics, 10117 Berlin, Germany,Berlin Institute of Health at Charité—Universitätsmedizin Berlin, 10117 Berlin, Germany
| | | |
Collapse
|
15
|
Li S, Wu S, Wang L, Li F, Jiang H, Bai F. Recent advances in predicting protein-protein interactions with the aid of artificial intelligence algorithms. Curr Opin Struct Biol 2022; 73:102344. [PMID: 35219216 DOI: 10.1016/j.sbi.2022.102344] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 01/02/2022] [Accepted: 01/17/2022] [Indexed: 12/15/2022]
Abstract
Protein-protein interactions (PPIs) are essential in the regulation of biological functions and cell events, therefore understanding PPIs have become a key issue to understanding the molecular mechanism and investigating the design of drugs. Here we highlight the major developments in computational methods developed for predicting PPIs by using types of artificial intelligence algorithms. The first part introduces the source of experimental PPI data. The second part is devoted to the PPI prediction methods based on sequential information. The third part covers representative methods using structural information as the input feature. The last part is methods designed by combining different types of features. For each part, the state-of-the-art computational PPI prediction methods are reviewed in an inclusive view. Finally, we discuss the flaws existing in this area and future directions of next-generation algorithms.
Collapse
Affiliation(s)
- Shiwei Li
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Sanan Wu
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Lin Wang
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Fenglei Li
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China; School of Information Science and Technology, ShanghaiTech University, Shanghai, China
| | - Hualiang Jiang
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Pudong, Shanghai, 201203, China
| | - Fang Bai
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, China; School of Information Science and Technology, ShanghaiTech University, Shanghai, China.
| |
Collapse
|
16
|
Mahbub S, Bayzid MS. EGRET: edge aggregated graph attention networks and transfer learning improve protein-protein interaction site prediction. Brief Bioinform 2022; 23:6518045. [PMID: 35106547 DOI: 10.1093/bib/bbab578] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2021] [Revised: 11/25/2021] [Accepted: 12/16/2021] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Protein-protein interactions (PPIs) are central to most biological processes. However, reliable identification of PPI sites using conventional experimental methods is slow and expensive. Therefore, great efforts are being put into computational methods to identify PPI sites. RESULTS We present Edge Aggregated GRaph Attention NETwork (EGRET), a highly accurate deep learning-based method for PPI site prediction, where we have used an edge aggregated graph attention network to effectively leverage the structural information. We, for the first time, have used transfer learning in PPI site prediction. Our proposed edge aggregated network, together with transfer learning, has achieved notable improvement over the best alternate methods. Furthermore, we systematically investigated EGRET's network behavior to provide insights about the causes of its decisions. AVAILABILITY EGRET is freely available as an open source project at https://github.com/Sazan-Mahbub/EGRET. CONTACT shams_bayzid@cse.buet.ac.bd.
Collapse
Affiliation(s)
- Sazan Mahbub
- Department of Computer Science University of Maryland, College Park, Maryland 20742, USA
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering Bangladesh University of Engineering and Technology, Dhaka-1205, Bangladesh
| |
Collapse
|
17
|
Robust and accurate prediction of protein-protein interactions by exploiting evolutionary information. Sci Rep 2021; 11:16910. [PMID: 34413375 PMCID: PMC8376940 DOI: 10.1038/s41598-021-96265-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Accepted: 04/15/2021] [Indexed: 02/07/2023] Open
Abstract
Various biochemical functions of organisms are performed by protein-protein interactions (PPIs). Therefore, recognition of protein-protein interactions is very important for understanding most life activities, such as DNA replication and transcription, protein synthesis and secretion, signal transduction and metabolism. Although high-throughput technology makes it possible to generate large-scale PPIs data, it requires expensive cost of both time and labor, and leave a risk of high false positive rate. In order to formulate a more ingenious solution, biology community is looking for computational methods to quickly and efficiently discover massive protein interaction data. In this paper, we propose a computational method for predicting PPIs based on a fresh idea of combining orthogonal locality preserving projections (OLPP) and rotation forest (RoF) models, using protein sequence information. Specifically, the protein sequence is first converted into position-specific scoring matrices (PSSMs) containing protein evolutionary information by using the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST). Then we characterize a protein as a fixed length feature vector by applying OLPP to PSSMs. Finally, we train an RoF classifier for the purpose of identifying non-interacting and interacting protein pairs. The proposed method yielded a significantly better results than existing methods, with 90.07% and 96.09% prediction accuracy on Yeast and Human datasets. Our experiment show the proposed method can serve as a useful tool to accelerate the process of solving key problems in proteomics.
Collapse
|
18
|
Schlick T, Portillo-Ledesma S, Myers CG, Beljak L, Chen J, Dakhel S, Darling D, Ghosh S, Hall J, Jan M, Liang E, Saju S, Vohr M, Wu C, Xu Y, Xue E. Biomolecular Modeling and Simulation: A Prospering Multidisciplinary Field. Annu Rev Biophys 2021; 50:267-301. [PMID: 33606945 PMCID: PMC8105287 DOI: 10.1146/annurev-biophys-091720-102019] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
We reassess progress in the field of biomolecular modeling and simulation, following up on our perspective published in 2011. By reviewing metrics for the field's productivity and providing examples of success, we underscore the productive phase of the field, whose short-term expectations were overestimated and long-term effects underestimated. Such successes include prediction of structures and mechanisms; generation of new insights into biomolecular activity; and thriving collaborations between modeling and experimentation, including experiments driven by modeling. We also discuss the impact of field exercises and web games on the field's progress. Overall, we note tremendous success by the biomolecular modeling community in utilization of computer power; improvement in force fields; and development and application of new algorithms, notably machine learning and artificial intelligence. The combined advances are enhancing the accuracy andscope of modeling and simulation, establishing an exemplary discipline where experiment and theory or simulations are full partners.
Collapse
Affiliation(s)
- Tamar Schlick
- Department of Chemistry, New York University, New York, New York 10003, USA;
- Courant Institute of Mathematical Sciences, New York University, New York, New York 10012, USA
- New York University-East China Normal University Center for Computational Chemistry, New York University Shanghai, Shanghai 200122, China
| | | | - Christopher G Myers
- Department of Chemistry, New York University, New York, New York 10003, USA;
| | - Lauren Beljak
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Justin Chen
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Sami Dakhel
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Daniel Darling
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Sayak Ghosh
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Joseph Hall
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Mikaeel Jan
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Emily Liang
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Sera Saju
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Mackenzie Vohr
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Chris Wu
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Yifan Xu
- College of Arts and Science, New York University, New York, New York 10003, USA
| | - Eva Xue
- College of Arts and Science, New York University, New York, New York 10003, USA
| |
Collapse
|
19
|
Systematic auditing is essential to debiasing machine learning in biology. Commun Biol 2021; 4:183. [PMID: 33568741 PMCID: PMC7876113 DOI: 10.1038/s42003-021-01674-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2020] [Accepted: 11/12/2020] [Indexed: 12/20/2022] Open
Abstract
Biases in data used to train machine learning (ML) models can inflate their prediction performance and confound our understanding of how and what they learn. Although biases are common in biological data, systematic auditing of ML models to identify and eliminate these biases is not a common practice when applying ML in the life sciences. Here we devise a systematic, principled, and general approach to audit ML models in the life sciences. We use this auditing framework to examine biases in three ML applications of therapeutic interest and identify unrecognized biases that hinder the ML process and result in substantially reduced model performance on new datasets. Ultimately, we show that ML models tend to learn primarily from data biases when there is insufficient signal in the data to learn from. We provide detailed protocols, guidelines, and examples of code to enable tailoring of the auditing framework to other biomedical applications. Fatma-Elzahraa Eid et al. illustrate a principled approach for identifying biases that can inflate the performance of biological machine learning models. When applied to three biomedical prediction problems, they identify previously unrecognized biases and ultimately show that models are likely to learn primarily from data biases when there is insufficient learnable signal in the data.
Collapse
|
20
|
Poot Velez AH, Fontove F, Del Rio G. Protein-Protein Interactions Efficiently Modeled by Residue Cluster Classes. Int J Mol Sci 2020; 21:E4787. [PMID: 32640745 PMCID: PMC7370293 DOI: 10.3390/ijms21134787] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Revised: 06/20/2020] [Accepted: 06/28/2020] [Indexed: 01/22/2023] Open
Abstract
Predicting protein-protein interactions (PPI) represents an important challenge in structural bioinformatics. Current computational methods display different degrees of accuracy when predicting these interactions. Different factors were proposed to help improve these predictions, including choosing the proper descriptors of proteins to represent these interactions, among others. In the current work, we provide a representative protein structure that is amenable to PPI classification using machine learning approaches, referred to as residue cluster classes. Through sampling and optimization, we identified the best algorithm-parameter pair to classify PPI from more than 360 different training sets. We tested these classifiers against PPI datasets that were not included in the training set but shared sequence similarity with proteins in the training set to reproduce the situation of most proteins sharing sequence similarity with others. We identified a model with almost no PPI error (96-99% of correctly classified instances) and showed that residue cluster classes of protein pairs displayed a distinct pattern between positive and negative protein interactions. Our results indicated that residue cluster classes are structural features relevant to model PPI and provide a novel tool to mathematically model the protein structure/function relationship.
Collapse
Affiliation(s)
- Albros Hermes Poot Velez
- Department of biochemistry and structural biology, Instituto de fisiologia celular, UNAM Mexico City 04510, Mexico;
| | | | - Gabriel Del Rio
- Department of biochemistry and structural biology, Instituto de fisiologia celular, UNAM Mexico City 04510, Mexico;
| |
Collapse
|
21
|
Gemovic B, Sumonja N, Davidovic R, Perovic V, Veljkovic N. Mapping of Protein-Protein Interactions: Web-Based Resources for Revealing Interactomes. Curr Med Chem 2019; 26:3890-3910. [PMID: 29446725 DOI: 10.2174/0929867325666180214113704] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Revised: 09/14/2017] [Accepted: 01/29/2018] [Indexed: 01/04/2023]
Abstract
BACKGROUND The significant number of protein-protein interactions (PPIs) discovered by harnessing concomitant advances in the fields of sequencing, crystallography, spectrometry and two-hybrid screening suggests astonishing prospects for remodelling drug discovery. The PPI space which includes up to 650 000 entities is a remarkable reservoir of potential therapeutic targets for every human disease. In order to allow modern drug discovery programs to leverage this, we should be able to discern complete PPI maps associated with a specific disorder and corresponding normal physiology. OBJECTIVE Here, we will review community available computational programs for predicting PPIs and web-based resources for storing experimentally annotated interactions. METHODS We compared the capacities of prediction tools: iLoops, Struck2Net, HOMCOS, COTH, PrePPI, InterPreTS and PRISM to predict recently discovered protein interactions. RESULTS We described sequence-based and structure-based PPI prediction tools and addressed their peculiarities. Additionally, since the usefulness of prediction algorithms critically depends on the quality and quantity of the experimental data they are built on; we extensively discussed community resources for protein interactions. We focused on the active and recently updated primary and secondary PPI databases, repositories specialized to the subject or species, as well as databases that include both experimental and predicted PPIs. CONCLUSION PPI complexes are the basis of important physiological processes and therefore, possible targets for cell-penetrating ligands. Reliable computational PPI predictions can speed up new target discoveries through prioritization of therapeutically relevant protein-protein complexes for experimental studies.
Collapse
Affiliation(s)
- Branislava Gemovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Neven Sumonja
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Radoslav Davidovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Vladimir Perovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Nevena Veljkovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| |
Collapse
|
22
|
Scheibenreif L, Littmann M, Orengo C, Rost B. FunFam protein families improve residue level molecular function prediction. BMC Bioinformatics 2019; 20:400. [PMID: 31319797 PMCID: PMC6639920 DOI: 10.1186/s12859-019-2988-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 07/09/2019] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND The CATH database provides a hierarchical classification of protein domain structures including a sub-classification of superfamilies into functional families (FunFams). We analyzed the similarity of binding site annotations in these FunFams and incorporated FunFams into the prediction of protein binding residues. RESULTS FunFam members agreed, on average, in 36.9 ± 0.6% of their binding residue annotations. This constituted a 6.7-fold increase over randomly grouped proteins and a 1.2-fold increase (1.1-fold on the same dataset) over proteins with the same enzymatic function (identical Enzyme Commission, EC, number). Mapping de novo binding residue prediction methods (BindPredict-CCS, BindPredict-CC) onto FunFam resulted in consensus predictions for those residues that were aligned and predicted alike (binding/non-binding) within a FunFam. This simple consensus increased the F1-score (for binding) 1.5-fold over the original prediction method. Variation of the threshold for how many proteins in the consensus prediction had to agree provided a convenient control of accuracy/precision and coverage/recall, e.g. reaching a precision as high as 60.8 ± 0.4% for a stringent threshold. CONCLUSIONS The FunFams outperformed even the carefully curated EC numbers in terms of agreement of binding site residues. Additionally, we assume that our proof-of-principle through the prediction of protein binding residues will be relevant for many other solutions profiting from FunFams to infer functional information at the residue level.
Collapse
Affiliation(s)
- Linus Scheibenreif
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany.
| | - Maria Littmann
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany.
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
- Department of Biochemistry and Molecular Biophysics & New York Consortium on Membrane Protein Structure (NYCOMPS), Columbia University, 701 West, 168th Street, New York, NY 10032, USA
| |
Collapse
|
23
|
Wang X, Yu B, Ma A, Chen C, Liu B, Ma Q. Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics 2019; 35:2395-2402. [PMID: 30520961 PMCID: PMC6612859 DOI: 10.1093/bioinformatics/bty995] [Citation(s) in RCA: 93] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2018] [Revised: 11/19/2018] [Accepted: 12/03/2018] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The prediction of protein-protein interaction (PPI) sites is a key to mutation design, catalytic reaction and the reconstruction of PPI networks. It is a challenging task considering the significant abundant sequences and the imbalance issue in samples. RESULTS A new ensemble learning-based method, Ensemble Learning of synthetic minority oversampling technique (SMOTE) for Unbalancing samples and RF algorithm (EL-SMURF), was proposed for PPI sites prediction in this study. The sequence profile feature and the residue evolution rates were combined for feature extraction of neighboring residues using a sliding window, and the SMOTE was applied to oversample interface residues in the feature space for the imbalance problem. The Multi-dimensional Scaling feature selection method was implemented to reduce feature redundancy and subset selection. Finally, the Random Forest classifiers were applied to build the ensemble learning model, and the optimal feature vectors were inserted into EL-SMURF to predict PPI sites. The performance validation of EL-SMURF on two independent validation datasets showed 77.1% and 77.7% accuracy, which were 6.2-15.7% and 6.1-18.9% higher than the other existing tools, respectively. AVAILABILITY AND IMPLEMENTATION The source codes and data used in this study are publicly available at http://github.com/QUST-AIBBDRC/EL-SMURF/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaoying Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, China
- School of Life Sciences, University of Science and Technology of China, Hefei, China
| | - Anjun Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
- Department Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Cheng Chen
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, China
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, China
| | - Qin Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
- Department Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
24
|
Sumonja N, Gemovic B, Veljkovic N, Perovic V. Automated feature engineering improves prediction of protein-protein interactions. Amino Acids 2019; 51:1187-1200. [PMID: 31278492 DOI: 10.1007/s00726-019-02756-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2019] [Accepted: 06/26/2019] [Indexed: 10/26/2022]
Abstract
Over the last decade, various machine learning (ML) and statistical approaches for protein-protein interaction (PPI) predictions have been developed to help annotating functional interactions among proteins, essential for our system-level understanding of life. Efficient ML approaches require informative and non-redundant features. In this paper, we introduce novel types of expert-crafted sequence, evolutionary and graph features and apply automatic feature engineering to further expand feature space to improve predictive modeling. The two-step automatic feature-engineering process encompasses the hybrid method for feature generation and unsupervised feature selection, followed by supervised feature selection through a genetic algorithm (GA). The optimization of both steps allows the feature-engineering procedure to operate on a large transformed feature space with no considerable computational cost and to efficiently provide newly engineered features. Based on GA and correlation filtering, we developed a stacking algorithm GA-STACK for automatic ensembling of different ML algorithms to improve prediction performance. We introduced a unified method, HP-GAS, for the prediction of human PPIs, which incorporates GA-STACK and rests on both expert-crafted and 40% of newly engineered features. The extensive cross validation and comparison with the state-of-the-art methods showed that HP-GAS represents currently the most efficient method for proteome-wide forecasting of protein interactions, with prediction efficacy of 0.93 AUC and 0.85 accuracy. We implemented the HP-GAS method as a free standalone application which is a time-efficient and easy-to-use tool. HP-GAS software with supplementary data can be downloaded from: http://www.vinca.rs/180/tools/HP-GAS.php .
Collapse
Affiliation(s)
- Neven Sumonja
- Laboratory for Bioinformatics and Computational Chemistry, Vinca Institute of Nuclear Sciences, University of Belgrade, Mike Petrovica Alasa 12-14, Vinca, Belgrade, 11351, Serbia
| | - Branislava Gemovic
- Laboratory for Bioinformatics and Computational Chemistry, Vinca Institute of Nuclear Sciences, University of Belgrade, Mike Petrovica Alasa 12-14, Vinca, Belgrade, 11351, Serbia
| | - Nevena Veljkovic
- Laboratory for Bioinformatics and Computational Chemistry, Vinca Institute of Nuclear Sciences, University of Belgrade, Mike Petrovica Alasa 12-14, Vinca, Belgrade, 11351, Serbia
| | - Vladimir Perovic
- Laboratory for Bioinformatics and Computational Chemistry, Vinca Institute of Nuclear Sciences, University of Belgrade, Mike Petrovica Alasa 12-14, Vinca, Belgrade, 11351, Serbia.
| |
Collapse
|
25
|
Kamal H, Minhas FUAA, Farooq M, Tripathi D, Hamza M, Mustafa R, Khan MZ, Mansoor S, Pappu HR, Amin I. In silico Prediction and Validations of Domains Involved in Gossypium hirsutum SnRK1 Protein Interaction With Cotton Leaf Curl Multan Betasatellite Encoded βC1. FRONTIERS IN PLANT SCIENCE 2019; 10:656. [PMID: 31191577 PMCID: PMC6546731 DOI: 10.3389/fpls.2019.00656] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2018] [Accepted: 05/01/2019] [Indexed: 05/19/2023]
Abstract
Cotton leaf curl disease (CLCuD) caused by viruses of genus Begomovirus is a major constraint to cotton (Gossypium hirsutum) production in many cotton-growing regions of the world. Symptoms of the disease are caused by Cotton leaf curl Multan betasatellite (CLCuMB) that encodes a pathogenicity determinant protein, βC1. Here, we report the identification of interacting regions in βC1 protein by using computational approaches including sequence recognition, and binding site and interface prediction methods. We show the domain-level interactions based on the structural analysis of G. hirsutum SnRK1 protein and its domains with CLCuMB-βC1. To verify and validate the in silico predictions, three different experimental approaches, yeast two hybrid, bimolecular fluorescence complementation and pull down assay were used. Our results showed that ubiquitin-associated domain (UBA) and autoinhibitory sequence (AIS) domains of G. hirsutum-encoded SnRK1 are involved in CLCuMB-βC1 interaction. This is the first comprehensive investigation that combined in silico interaction prediction followed by experimental validation of interaction between CLCuMB-βC1 and a host protein. We demonstrated that data from computational biology could provide binding site information between CLCuD-associated viruses/satellites and new hosts that lack known binding site information for protein-protein interaction studies. Implications of these findings are discussed.
Collapse
Affiliation(s)
- Hira Kamal
- National Institute for Biotechnology and Genetic Engineering, Faisalabad, Pakistan
- Pakistan Institute of Engineering and Applied Sciences, Islamabad, Pakistan
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
| | | | - Muhammad Farooq
- National Institute for Biotechnology and Genetic Engineering, Faisalabad, Pakistan
| | - Diwaker Tripathi
- Department of Biology, University of Washington, Seattle, WA, United States
| | - Muhammad Hamza
- National Institute for Biotechnology and Genetic Engineering, Faisalabad, Pakistan
| | - Roma Mustafa
- National Institute for Biotechnology and Genetic Engineering, Faisalabad, Pakistan
| | - Muhammad Zuhaib Khan
- National Institute for Biotechnology and Genetic Engineering, Faisalabad, Pakistan
| | - Shahid Mansoor
- National Institute for Biotechnology and Genetic Engineering, Faisalabad, Pakistan
| | - Hanu R. Pappu
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
| | - Imran Amin
- National Institute for Biotechnology and Genetic Engineering, Faisalabad, Pakistan
| |
Collapse
|
26
|
Pfeiffenberger E, Bates PA. Predicting improved protein conformations with a temporal deep recurrent neural network. PLoS One 2018; 13:e0202652. [PMID: 30180164 PMCID: PMC6122789 DOI: 10.1371/journal.pone.0202652] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2018] [Accepted: 08/07/2018] [Indexed: 02/03/2023] Open
Abstract
Accurate protein structure prediction from amino acid sequence is still an unsolved problem. The most reliable methods centre on template based modelling. However, the accuracy of these models entirely depends on the availability of experimentally resolved homologous template structures. In order to generate more accurate models, extensive physics based molecular dynamics (MD) refinement simulations are performed to sample many different conformations to find improved conformational states. In this study, we propose a deep recurrent network model, called DeepTrajectory, that is able to identify these improved conformational states, with high precision, from a variety of different MD based sampling protocols. The proposed model learns the temporal patterns of features computed from MD trajectory data in order to classify whether each recorded simulation snapshot is an improved quality conformational state, decreased quality conformational state or whether there is no perceivable change in state with respect to the starting conformation. The model was trained and tested on 904 trajectories from 42 different protein systems with a cumulative number of more than 1.7 million snapshots. We show that our model outperforms other state of the art machine-learning algorithms that do not consider temporal dependencies. To our knowledge, DeepTrajectory is the first implementation of a time-dependent deep-learning protocol that is re-trainable and able to adapt to any new MD based sampling procedure, thereby demonstrating how a neural network can be used to learn the latter part of the protein folding funnel.
Collapse
Affiliation(s)
- Erik Pfeiffenberger
- Biomolecular Modelling Laboratory, The Francis Crick Institute, 1 Midland Road, London NW1 1AT, United Kingdom
| | - Paul A. Bates
- Biomolecular Modelling Laboratory, The Francis Crick Institute, 1 Midland Road, London NW1 1AT, United Kingdom
| |
Collapse
|
27
|
Tran L, Hamp T, Rost B. ProfPPIdb: Pairs of physical protein-protein interactions predicted for entire proteomes. PLoS One 2018; 13:e0199988. [PMID: 30020956 PMCID: PMC6051629 DOI: 10.1371/journal.pone.0199988] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2018] [Accepted: 06/17/2018] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Protein-protein interactions (PPIs) play a key role in many cellular processes. Most annotations of PPIs mix experimental and computational data. The mix optimizes coverage, but obfuscates the annotation origin. Some resources excel at focusing on reliable experimental data. Here, we focused on new pairs of interacting proteins for several model organisms based solely on sequence-based prediction methods. RESULTS We extracted reliable experimental data about which proteins interact (binary) for eight diverse model organisms from public databases, namely from Escherichia coli, Schizosaccharomyces pombe, Plasmodium falciparum, Drosophila melanogaster, Caenorhabditis elegans, Mus musculus, Rattus norvegicus, Arabidopsis thaliana, and for the previously used Homo sapiens and Saccharomyces cerevisiae. Those data were the base to develop a PPI prediction method for each model organism. The method used evolutionary information through a profile-kernel Support Vector Machine (SVM). With the resulting eight models, we predicted all possible protein pairs in each organism and made the top predictions available through a web application. Almost all of the PPIs made available were predicted between proteins that have not been observed in any interaction, in particular for less well-studied organisms. Thus, our work complements existing resources and is particularly helpful for designing experiments because of its uniqueness. Experimental annotations and computational predictions are strongly influenced by the fact that some proteins have many partners and others few. To optimize machine learning, recent methods explicitly ignored such a network-structure and rely either on domain knowledge or sequence-only methods. Our approach is independent of domain-knowledge and leverages evolutionary information. The database interface representing our results is accessible from https://rostlab.org/services/ppipair/. The data can also be downloaded from https://figshare.com/collections/ProfPPI-DB/4141784.
Collapse
Affiliation(s)
- Linh Tran
- Imperial College London (ICL), Department of Computing, United Kingdom
- Technical University of Munich (TUM), Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr, Germany
- * E-mail:
| | - Tobias Hamp
- Technical University of Munich (TUM), Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr, Germany
| | - Burkhard Rost
- Technical University of Munich (TUM), Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr, Germany
- Technical University of Munich (TUM), Institute for Advanced Study (TUM-IAS), Lichtenbergstr, Germany
| |
Collapse
|
28
|
Perovic V, Sumonja N, Marsh LA, Radovanovic S, Vukicevic M, Roberts SGE, Veljkovic N. IDPpi: Protein-Protein Interaction Analyses of Human Intrinsically Disordered Proteins. Sci Rep 2018; 8:10563. [PMID: 30002402 PMCID: PMC6043496 DOI: 10.1038/s41598-018-28815-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2018] [Accepted: 06/28/2018] [Indexed: 01/04/2023] Open
Abstract
Intrinsically disordered proteins (IDPs) are characterized by the lack of a fixed tertiary structure and are involved in the regulation of key biological processes via binding to multiple protein partners. IDPs are malleable, adapting to structurally different partners, and this flexibility stems from features encoded in the primary structure. The assumption that universal sequence information will facilitate coverage of the sparse zones of the human interactome motivated us to explore the possibility of predicting protein-protein interactions (PPIs) that involve IDPs based on sequence characteristics. We developed a method that relies on features of the interacting and non-interacting protein pairs and utilizes machine learning to classify and predict IDP PPIs. Consideration of both sequence determinants specific for conformational organizations and the multiplicity of IDP interactions in the training phase ensured a reliable approach that is superior to current state-of-the-art methods. By applying a strict evaluation procedure, we confirm that our method predicts interactions of the IDP of interest even on the proteome-scale. This service is provided as a web tool to expedite the discovery of new interactions and IDP functions with enhanced efficiency.
Collapse
Affiliation(s)
- Vladimir Perovic
- Centre for Multidisciplinary Research and Engineering, Vinca Institute of Nuclear Sciences, University of Belgrade, Belgrade, Serbia
| | - Neven Sumonja
- Centre for Multidisciplinary Research and Engineering, Vinca Institute of Nuclear Sciences, University of Belgrade, Belgrade, Serbia
| | - Lindsey A Marsh
- School of Cellular and Molecular Medicine, University of Bristol, Bristol, UK
| | - Sandro Radovanovic
- Centre for business decision making, Faculty of organizational Sciences, University of Belgrade, Belgrade, Serbia
| | - Milan Vukicevic
- Centre for business decision making, Faculty of organizational Sciences, University of Belgrade, Belgrade, Serbia
| | - Stefan G E Roberts
- School of Cellular and Molecular Medicine, University of Bristol, Bristol, UK
| | - Nevena Veljkovic
- Centre for Multidisciplinary Research and Engineering, Vinca Institute of Nuclear Sciences, University of Belgrade, Belgrade, Serbia.
| |
Collapse
|
29
|
Peeken JC, Bernhofer M, Wiestler B, Goldberg T, Cremers D, Rost B, Wilkens JJ, Combs SE, Nüsslin F. Radiomics in radiooncology - Challenging the medical physicist. Phys Med 2018; 48:27-36. [PMID: 29728226 DOI: 10.1016/j.ejmp.2018.03.012] [Citation(s) in RCA: 56] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Revised: 03/07/2018] [Accepted: 03/20/2018] [Indexed: 02/06/2023] Open
Abstract
PURPOSE Noticing the fast growing translation of artificial intelligence (AI) technologies to medical image analysis this paper emphasizes the future role of the medical physicist in this evolving field. Specific challenges are addressed when implementing big data concepts with high-throughput image data processing like radiomics and machine learning in a radiooncology environment to support clinical decisions. METHODS Based on the experience of our interdisciplinary radiomics working group, techniques for processing minable data, extracting radiomics features and associating this information with clinical, physical and biological data for the development of prediction models are described. A special emphasis was placed on the potential clinical significance of such an approach. RESULTS Clinical studies demonstrate the role of radiomics analysis as an additional independent source of information with the potential to influence the radiooncology practice, i.e. to predict patient prognosis, treatment response and underlying genetic changes. Extending the radiomics approach to integrate imaging, clinical, genetic and dosimetric data ('panomics') challenges the medical physicist as member of the radiooncology team. CONCLUSIONS The new field of big data processing in radiooncology offers opportunities to support clinical decisions, to improve predicting treatment outcome and to stimulate fundamental research on radiation response both of tumor and normal tissue. The integration of physical data (e.g. treatment planning, dosimetric, image guidance data) demands an involvement of the medical physicist in the radiomics approach of radiooncology. To cope with this challenge national and international organizations for medical physics should organize more training opportunities in artificial intelligence technologies in radiooncology.
Collapse
Affiliation(s)
- Jan C Peeken
- Department of Radiation Oncology, Klinikum rechts der Isar, Technical University of Munich (TUM), Ismaninger Straße 22, 81675 Munich, Germany; Deutsches Konsortium für Translationale Krebsforschung (DKTK), Partner Site Munich, Germany
| | - Michael Bernhofer
- Department of Informatics, Technical University of Munich (TUM), Boltzmannstraße 3, 85748 Garching, Germany
| | - Benedikt Wiestler
- Department of Neuroradiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Ismaninger Straße 22, 81675 Munich, Germany; Deutsches Konsortium für Translationale Krebsforschung (DKTK), Partner Site Munich, Germany
| | | | - Daniel Cremers
- Department of Informatics, Technical University of Munich (TUM), Boltzmannstraße 3, 85748 Garching, Germany
| | - Burkhard Rost
- Department of Informatics, Technical University of Munich (TUM), Boltzmannstraße 3, 85748 Garching, Germany
| | - Jan J Wilkens
- Department of Radiation Oncology, Klinikum rechts der Isar, Technical University of Munich (TUM), Ismaninger Straße 22, 81675 Munich, Germany
| | - Stephanie E Combs
- Department of Radiation Oncology, Klinikum rechts der Isar, Technical University of Munich (TUM), Ismaninger Straße 22, 81675 Munich, Germany; Institute of Innovative Radiotherapy (iRT), Department of Radiation Sciences (DRS), Helmholtz Zentrum München, Ingolstaedter Landstrasse 1, 85764 Neuherberg, Germany; Deutsches Konsortium für Translationale Krebsforschung (DKTK), Partner Site Munich, Germany
| | - Fridtjof Nüsslin
- Department of Radiation Oncology, Klinikum rechts der Isar, Technical University of Munich (TUM), Ismaninger Straße 22, 81675 Munich, Germany.
| |
Collapse
|
30
|
Vyas R, Bapat S, Goel P, Karthikeyan M, Tambe SS, Kulkarni BD. Application of Genetic Programming (GP) Formalism for Building Disease Predictive Models from Protein-Protein Interactions (PPI) Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:27-37. [PMID: 28113781 DOI: 10.1109/tcbb.2016.2621042] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Protein-protein interactions (PPIs) play a vital role in the biological processes involved in the cell functions and disease pathways. The experimental methods known to predict PPIs require tremendous efforts and the results are often hindered by the presence of a large number of false positives. Herein, we demonstrate the use of a new Genetic Programming (GP) based Symbolic Regression (SR) approach for predicting PPIs related to a disease. In a case study, a dataset consisting of one hundred and thirty five PPI complexes related to cancer was used to construct a generic PPI predicting model with good PPI prediction accuracy and generalization ability. A high correlation coefficient(CC) of 0.893, low root mean square error (RMSE) and mean absolute percentage error (MAPE) values of 478.221 and 0.239, respectively were achieved for both the training and test set outputs. To validate the discriminatory nature of the model, it was applied on a dataset of diabetes complexes where it yielded significantly low CC values. Thus, the GP model developed here serves a dual purpose: (a)a predictor of the binding energy of cancer related PPI complexes, and (b)a classifier for discriminating PPI complexes related to cancer from those of other diseases.
Collapse
|
31
|
Kotlyar M, Rossos AEM, Jurisica I. Prediction of Protein-Protein Interactions. ACTA ACUST UNITED AC 2017; 60:8.2.1-8.2.14. [PMID: 29220074 DOI: 10.1002/cpbi.38] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
The authors provide an overview of physical protein-protein interaction prediction, covering the main strategies for predicting interactions, approaches for assessing predictions, and online resources for accessing predictions. This unit focuses on the main advancements in each of these areas over the last decade. The methods and resources that are presented here are not an exhaustive set, but characterize the current state of the field-highlighting key challenges and achievements. © 2017 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Max Kotlyar
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Andrea E M Rossos
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Igor Jurisica
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.,Departments of Medical Biophysics and Computer Science, University of Toronto, Ontario, Canada.,Institute of Neuroimmunology, Slovak Academy of Sciences, Bratislava, Slovakia
| |
Collapse
|
32
|
Tramontano A. The computational prediction of protein assemblies. Curr Opin Struct Biol 2017; 46:170-175. [PMID: 29102305 DOI: 10.1016/j.sbi.2017.10.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2017] [Revised: 10/04/2017] [Accepted: 10/05/2017] [Indexed: 10/18/2022]
Abstract
The function of proteins in the cell is almost always mediated by their interaction with different partners, including other proteins, nucleic acids or small organic molecules. The ability of identifying all of them is an essential step in our quest for understanding life at the molecular level. The inference of the protein complex composition and of its molecular details can also provide relevant clues for the development and the design of drugs. In this short review, I will discuss the computational aspects of the analysis and prediction of protein-protein assemblies and discuss some of the most recent developments as seen in the last Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment.
Collapse
Affiliation(s)
- Anna Tramontano
- Physics Department, Sapienza University of Rome, Piazzale Aldo Moro, 5 I-00185 Roma, Italy; Istituto Pasteur - Fondazione Cenci Bolognetti, Sapienza University of Rome, Piazzale Aldo Moro, 5 I-00185 Roma, Italy
| |
Collapse
|
33
|
Choi D, Park B, Chae H, Lee W, Han K. Predicting protein-binding regions in RNA using nucleotide profiles and compositions. BMC SYSTEMS BIOLOGY 2017; 11:16. [PMID: 28361677 PMCID: PMC5374631 DOI: 10.1186/s12918-017-0386-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Background Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use. Results We developed a new support vector machine (SVM) model for predicting protein-binding regions in mRNA sequences. The model uses sequence profiles constructed from log-odds scores of mono- and di-nucleotides and nucleotide compositions. The model was evaluated by standard 10-fold cross validation, leave-one-protein-out (LOPO) cross validation and independent testing. Since actual mRNA sequences have more non-binding regions than protein-binding regions, we tested the model on several datasets with different ratios of protein-binding regions to non-binding regions. The best performance of the model was obtained in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a positive predictive value (PPV) of 91.7%, a negative predictive value (NPV) of 92.3% and a Matthews correlation coefficient (MCC) of 0.840. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (87.6% accuracy and 0.752 MCC). In testing the model on independent datasets, it achieved an accuracy of 82.2% and an MCC of 0.656. Testing of our model and other state-of-the-art methods on a same dataset showed that our model is better than the others. Conclusions Sequence profiles of log-odds scores of mono- and di-nucleotides were much more powerful features than nucleotide compositions in finding protein-binding regions in RNA sequences. But, a slight performance gain was obtained when using the sequence profiles along with nucleotide compositions. These are preliminary results of ongoing research, but demonstrate the potential of our approach as a powerful predictor of protein-binding regions in RNA. The program and supporting data are available at http://bclab.inha.ac.kr/RBPbinding. Electronic supplementary material The online version of this article (doi:10.1186/s12918-017-0386-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Daesik Choi
- Department of Computer Science and Engineering, Inha University, Incheon, 22212, South Korea
| | - Byungkyu Park
- Department of Computer Science and Engineering, Inha University, Incheon, 22212, South Korea
| | - Hanju Chae
- Department of Computer Science and Engineering, Inha University, Incheon, 22212, South Korea
| | - Wook Lee
- Department of Computer Science and Engineering, Inha University, Incheon, 22212, South Korea
| | - Kyungsook Han
- Department of Computer Science and Engineering, Inha University, Incheon, 22212, South Korea.
| |
Collapse
|
34
|
Kuo TH, Li KB. Predicting Protein-Protein Interaction Sites Using Sequence Descriptors and Site Propensity of Neighboring Amino Acids. Int J Mol Sci 2016; 17:ijms17111788. [PMID: 27792167 PMCID: PMC5133789 DOI: 10.3390/ijms17111788] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2016] [Revised: 10/14/2016] [Accepted: 10/18/2016] [Indexed: 12/17/2022] Open
Abstract
Information about the interface sites of Protein–Protein Interactions (PPIs) is useful for many biological research works. However, despite the advancement of experimental techniques, the identification of PPI sites still remains as a challenging task. Using a statistical learning technique, we proposed a computational tool for predicting PPI interaction sites. As an alternative to similar approaches requiring structural information, the proposed method takes all of the input from protein sequences. In addition to typical sequence features, our method takes into consideration that interaction sites are not randomly distributed over the protein sequence. We characterized this positional preference using protein complexes with known structures, proposed a numerical index to estimate the propensity and then incorporated the index into a learning system. The resulting predictor, without using structural information, yields an area under the ROC curve (AUC) of 0.675, recall of 0.597, precision of 0.311 and accuracy of 0.583 on a ten-fold cross-validation experiment. This performance is comparable to the previous approach in which structural information was used. Upon introducing the B-factor data to our predictor, we demonstrated that the AUC can be further improved to 0.750. The tool is accessible at http://bsaltools.ym.edu.tw/predppis.
Collapse
Affiliation(s)
- Tzu-Hao Kuo
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei 112, Taiwan.
| | - Kuo-Bin Li
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei 112, Taiwan.
- Office of Information Management, National Yang-Ming University Hospital, Yilan 260, Taiwan.
| |
Collapse
|
35
|
Rost B, Radivojac P, Bromberg Y. Protein function in precision medicine: deep understanding with machine learning. FEBS Lett 2016; 590:2327-41. [PMID: 27423136 PMCID: PMC5937700 DOI: 10.1002/1873-3468.12307] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Revised: 07/12/2016] [Accepted: 07/12/2016] [Indexed: 12/21/2022]
Abstract
Precision medicine and personalized health efforts propose leveraging complex molecular, medical and family history, along with other types of personal data toward better life. We argue that this ambitious objective will require advanced and specialized machine learning solutions. Simply skimming some low-hanging results off the data wealth might have limited potential. Instead, we need to better understand all parts of the system to define medically relevant causes and effects: how do particular sequence variants affect particular proteins and pathways? How do these effects, in turn, cause the health or disease-related phenotype? Toward this end, deeper understanding will not simply diffuse from deeper machine learning, but from more explicit focus on understanding protein function, context-specific protein interaction networks, and impact of variation on both.
Collapse
Affiliation(s)
- Burkhard Rost
- Department of Informatics and Bioinformatics, Institute for Advanced Studies, Technical University of Munich, Garching, Germany
| | - Predrag Radivojac
- School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA
| |
Collapse
|
36
|
Kim B, Alguwaizani S, Zhou X, Huang DS, Park B, Han K. An improved method for predicting interactions between virus and human proteins. J Bioinform Comput Biol 2016; 15:1650024. [PMID: 27397631 DOI: 10.1142/s0219720016500244] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
The interaction of virus proteins with host proteins plays a key role in viral infection and consequent pathogenesis. Many computational methods have been proposed to predict protein-protein interactions (PPIs), but most of the computational methods are intended for PPIs within a species rather than PPIs across different species such as virus-host PPIs. We developed a method that represents key features of virus and human proteins of variable length into a feature vector of fixed length. The key features include the relative frequency of amino acid triplets (RFAT), the frequency difference of amino acid triplets (FDAT) between virus and host proteins, and amino acid composition (AC). We constructed several support vector machine (SVM) models to evaluate our method and to compare our method with others on PPIs between human and two types of viruses: human papillomaviruses (HPV) and hepatitis C virus (HCV). Comparison of our method to others with same datasets of HPV-human PPIs and HCV-human PPIs showed that the performance of our method is significantly higher than others in all performance measures. Using the SVM model with gene ontology (GO) annotations of proteins, we predicted new HPV-human PPIs. We believe our approach will be useful in predicting heterogeneous PPIs.
Collapse
Affiliation(s)
- Byungmin Kim
- * Department of Computer Science and Engineering, Inha University, Incheon 22212, South Korea
| | - Saud Alguwaizani
- * Department of Computer Science and Engineering, Inha University, Incheon 22212, South Korea
| | - Xiang Zhou
- * Department of Computer Science and Engineering, Inha University, Incheon 22212, South Korea
| | - De-Shuang Huang
- † Machine Learning and Systems Biology Lab, College of Electronics and Information Engineering, Tongji University, Shanghai 201804, P. R. China
| | - Byunkyu Park
- * Department of Computer Science and Engineering, Inha University, Incheon 22212, South Korea
| | - Kyungsook Han
- * Department of Computer Science and Engineering, Inha University, Incheon 22212, South Korea
| |
Collapse
|
37
|
Wu MY, Zhang XF, Dai DQ, Ou-Yang L, Zhu Y, Yan H. Regularized logistic regression with network-based pairwise interaction for biomarker identification in breast cancer. BMC Bioinformatics 2016; 17:108. [PMID: 26921029 PMCID: PMC4769543 DOI: 10.1186/s12859-016-0951-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 01/28/2016] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND To facilitate advances in personalized medicine, it is important to detect predictive, stable and interpretable biomarkers related with different clinical characteristics. These clinical characteristics may be heterogeneous with respect to underlying interactions between genes. Usually, traditional methods just focus on detection of differentially expressed genes without taking the interactions between genes into account. Moreover, due to the typical low reproducibility of the selected biomarkers, it is difficult to give a clear biological interpretation for a specific disease. Therefore, it is necessary to design a robust biomarker identification method that can predict disease-associated interactions with high reproducibility. RESULTS In this article, we propose a regularized logistic regression model. Different from previous methods which focus on individual genes or modules, our model takes gene pairs, which are connected in a protein-protein interaction network, into account. A line graph is constructed to represent the adjacencies between pairwise interactions. Based on this line graph, we incorporate the degree information in the model via an adaptive elastic net, which makes our model less dependent on the expression data. Experimental results on six publicly available breast cancer datasets show that our method can not only achieve competitive performance in classification, but also retain great stability in variable selection. Therefore, our model is able to identify the diagnostic and prognostic biomarkers in a more robust way. Moreover, most of the biomarkers discovered by our model have been verified in biochemical or biomedical researches. CONCLUSIONS The proposed method shows promise in the diagnosis of disease pathogenesis with different clinical characteristics. These advances lead to more accurate and stable biomarker discovery, which can monitor the functional changes that are perturbed by diseases. Based on these predictions, researchers may be able to provide suggestions for new therapeutic approaches.
Collapse
Affiliation(s)
- Meng-Yun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Guoding Road, Shanghai, 200433, China. .,Key Laboratory of Mathematical Economics SUFE, Ministry of Education, Guoding Road, Shanghai, 200433, China.
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Luoyu Road, Wuhan, 430079, China.
| | - Dao-Qing Dai
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Xingang West Road, Guangzhou, 510275, China.
| | - Le Ou-Yang
- College of Information Engineering, Shenzhen University, Nanhai Avenue, Shenzhen, 518060, China.
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Lumo Road, Wuhan, 430074, China.
| | - Hong Yan
- Department of Electronic and Engineering, City University of Hong Kong, Tat Chee Avenue, Hong Kong, 999077, China.
| |
Collapse
|
38
|
Abbasi WA, Minhas FUAA. Issues in performance evaluation for host-pathogen protein interaction prediction. J Bioinform Comput Biol 2016; 14:1650011. [PMID: 26932275 DOI: 10.1142/s0219720016500116] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
The study of interactions between host and pathogen proteins is important for understanding the underlying mechanisms of infectious diseases and for developing novel therapeutic solutions. Wet-lab techniques for detecting protein-protein interactions (PPIs) can benefit from computational predictions. Machine learning is one of the computational approaches that can assist biologists by predicting promising PPIs. A number of machine learning based methods for predicting host-pathogen interactions (HPI) have been proposed in the literature. The techniques used for assessing the accuracy of such predictors are of critical importance in this domain. In this paper, we question the effectiveness of K-fold cross-validation for estimating the generalization ability of HPI prediction for proteins with no known interactions. K-fold cross-validation does not model this scenario, and we demonstrate a sizable difference between its performance and the performance of an alternative evaluation scheme called leave one pathogen protein out (LOPO) cross-validation. LOPO is more effective in modeling the real world use of HPI predictors, specifically for cases in which no information about the interacting partners of a pathogen protein is available during training. We also point out that currently used metrics such as areas under the precision-recall or receiver operating characteristic curves are not intuitive to biologists and propose simpler and more directly interpretable metrics for this purpose.
Collapse
Affiliation(s)
- Wajid Arshad Abbasi
- 1 Department of Computer and Information Sciences, Pakistan Institute of Engineering and Applied Sciences (PIEAS), Nilore, Islamabad, Pakistan
| | - Fayyaz Ul Amir Afsar Minhas
- 1 Department of Computer and Information Sciences, Pakistan Institute of Engineering and Applied Sciences (PIEAS), Nilore, Islamabad, Pakistan
| |
Collapse
|
39
|
Esmaielbeiki R, Krawczyk K, Knapp B, Nebel JC, Deane CM. Progress and challenges in predicting protein interfaces. Brief Bioinform 2016; 17:117-31. [PMID: 25971595 PMCID: PMC4719070 DOI: 10.1093/bib/bbv027] [Citation(s) in RCA: 85] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Revised: 03/18/2015] [Indexed: 12/31/2022] Open
Abstract
The majority of biological processes are mediated via protein-protein interactions. Determination of residues participating in such interactions improves our understanding of molecular mechanisms and facilitates the development of therapeutics. Experimental approaches to identifying interacting residues, such as mutagenesis, are costly and time-consuming and thus, computational methods for this purpose could streamline conventional pipelines. Here we review the field of computational protein interface prediction. We make a distinction between methods which address proteins in general and those targeted at antibodies, owing to the radically different binding mechanism of antibodies. We organize the multitude of currently available methods hierarchically based on required input and prediction principles to provide an overview of the field.
Collapse
|
40
|
Madhukar NS, Elemento O, Pandey G. Prediction of Genetic Interactions Using Machine Learning and Network Properties. Front Bioeng Biotechnol 2015; 3:172. [PMID: 26579514 PMCID: PMC4620407 DOI: 10.3389/fbioe.2015.00172] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 10/12/2015] [Indexed: 12/04/2022] Open
Abstract
A genetic interaction (GI) is a type of interaction where the effect of one gene is modified by the effect of one or several other genes. These interactions are important for delineating functional relationships among genes and their corresponding proteins, as well as elucidating complex biological processes and diseases. An important type of GI - synthetic sickness or synthetic lethality - involves two or more genes, where the loss of either gene alone has little impact on cell viability, but the combined loss of all genes leads to a severe decrease in fitness (sickness) or cell death (lethality). The identification of GIs is an important problem for it can help delineate pathways, protein complexes, and regulatory dependencies. Synthetic lethal interactions have important clinical and biological significance, such as providing therapeutically exploitable weaknesses in tumors. While near systematic high-content screening for GIs is possible in single cell organisms such as yeast, the systematic discovery of GIs is extremely difficult in mammalian cells. Therefore, there is a great need for computational approaches to reliably predict GIs, including synthetic lethal interactions, in these organisms. Here, we review the state-of-the-art approaches, strategies, and rigorous evaluation methods for learning and predicting GIs, both under general (healthy/standard laboratory) conditions and under specific contexts, such as diseases.
Collapse
Affiliation(s)
- Neel S Madhukar
- Department of Physiology and Biophysics, Meyer Cancer Center, Institute for Precision Medicine and Institute for Computational Biomedicine, Weill Cornell Medical College , New York, NY , USA ; Tri-Institutional Training Program in Computational Biology and Medicine , New York, NY , USA
| | - Olivier Elemento
- Department of Physiology and Biophysics, Meyer Cancer Center, Institute for Precision Medicine and Institute for Computational Biomedicine, Weill Cornell Medical College , New York, NY , USA ; Tri-Institutional Training Program in Computational Biology and Medicine , New York, NY , USA
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences and Graduate School of Biomedical Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai , New York, NY , USA
| |
Collapse
|