1
|
Luo Y, Zheng X, Qiu M, Gou Y, Yang Z, Qu X, Chen Z, Lin Y. Deep learning and its applications in nuclear magnetic resonance spectroscopy. PROGRESS IN NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY 2025; 146-147:101556. [PMID: 40306798 DOI: 10.1016/j.pnmrs.2024.101556] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 12/26/2024] [Accepted: 12/30/2024] [Indexed: 05/02/2025]
Abstract
Nuclear Magnetic Resonance (NMR), as an advanced technology, has widespread applications in various fields like chemistry, biology, and medicine. However, issues such as long acquisition times for multidimensional spectra and low sensitivity limit the broader application of NMR. Traditional algorithms aim to address these issues but have limitations in speed and accuracy. Deep Learning (DL), a branch of Artificial Intelligence (AI) technology, has shown remarkable success in many fields including NMR. This paper presents an overview of the basics of DL and current applications of DL in NMR, highlights existing challenges, and suggests potential directions for improvement.
Collapse
Affiliation(s)
- Yao Luo
- Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University, Xiamen 361005, China
| | - Xiaoxu Zheng
- Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University, Xiamen 361005, China
| | - Mengjie Qiu
- Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University, Xiamen 361005, China
| | - Yaoping Gou
- Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University, Xiamen 361005, China
| | - Zhengxian Yang
- Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University, Xiamen 361005, China
| | - Xiaobo Qu
- Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University, Xiamen 361005, China
| | - Zhong Chen
- Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University, Xiamen 361005, China
| | - Yanqin Lin
- Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University, Xiamen 361005, China.
| |
Collapse
|
2
|
Weissenow K, Rost B. Are protein language models the new universal key? Curr Opin Struct Biol 2025; 91:102997. [PMID: 39921962 DOI: 10.1016/j.sbi.2025.102997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Revised: 12/20/2024] [Accepted: 01/16/2025] [Indexed: 02/10/2025]
Abstract
Protein language models (pLMs) capture some aspects of the grammar of the language of life as written in protein sequences. The so-called pLM embeddings implicitly contain this information. Therefore, embeddings can serve as the exclusive input into downstream supervised methods for protein prediction. Over the last 33 years, evolutionary information extracted through simple averaging for specific protein families from multiple sequence alignments (MSAs) has been the most successful universal key to the success of protein prediction. For many applications, MSA-free pLM-based predictions now have become significantly more accurate. The reason for this is often a combination of two aspects. Firstly, embeddings condense the grammar so efficiently that downstream prediction methods succeed with small models, i.e., they need few free parameters in particular in the era of exploding deep neural networks. Secondly, pLM-based methods provide protein-specific solutions. As additional benefit, once the pLM pre-training is complete, pLM-based solutions tend to consume much fewer resources than MSA-based solutions. In fact, we appeal to the community to rather optimize foundation models than to retrain new ones and to evolve incentives for solutions that require fewer resources even at some loss in accuracy. Although pLMs have not, yet, succeeded to entirely replace the body of solutions developed over three decades, they clearly are rapidly advancing as the universal key for protein prediction.
Collapse
Affiliation(s)
- Konstantin Weissenow
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany.
| | - Burkhard Rost
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany; TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
3
|
Zhang J, Qian J, Zou Q, Zhou F, Kurgan L. Recent Advances in Computational Prediction of Secondary and Supersecondary Structures from Protein Sequences. Methods Mol Biol 2025; 2870:1-19. [PMID: 39543027 DOI: 10.1007/978-1-0716-4213-9_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
Abstract
The secondary structures (SSs) and supersecondary structures (SSSs) underlie the three-dimensional structure of proteins. Prediction of the SSs and SSSs from protein sequences enjoys high levels of use and finds numerous applications in the development of a broad range of other bioinformatics tools. Numerous sequence-based predictors of SS and SSS were developed and published in recent years. We survey and analyze 45 SS predictors that were released since 2018, focusing on their inputs, predictive models, scope of their prediction, and availability. We also review 32 sequence-based SSS predictors, which primarily focus on predicting coiled coils and beta-hairpins and which include five methods that were published since 2018. Substantial majority of these predictive tools rely on machine learning models, including a variety of deep neural network architectures. They also frequently use evolutionary sequence profiles. We discuss details of several modern SS and SSS predictors that are currently available to the users and which were published in higher impact venues.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, China.
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China.
| | - Jingjing Qian
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Feng Zhou
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, China
| | - Lukasz Kurgan
- Department of Computer Science, College of Engineering, Virginia Commonwealth University, Virginia, VA, USA.
| |
Collapse
|
4
|
Badaczewska-Dawid AE, Kolinski A. Importance of Secondary Structure Data in Large Scale Protein Modeling Using Low-Resolution SURPASS Method. Methods Mol Biol 2025; 2867:55-78. [PMID: 39576575 DOI: 10.1007/978-1-0716-4196-5_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2024]
Abstract
Secondary structure elements, such as alpha helices and beta strands, play a fundamental role in defining the overall fold of a protein. Leveraging secondary structure information is essential for encoding the structural features in coarse-grained protein models. Such models simplify the representation of amino acid residues, thereby reducing computational complexity. By incorporating accurate (even if only partial) secondary structure data, the models can efficiently search for the native conformation of proteins and preserve the core structural motifs across extended time frames. Here, the pivotal role of (predicted) secondary structure data in the coarse-grained modeling of protein tertiary and quaternary structures, along with their long-time dynamics, is investigated. Computational simulations of large protein systems using a low-resolution SURPASS model were performed. These case studies demonstrate the sufficiency of predicted secondary structure data in an accurate fold assembly. It leads to a realistic depiction of long-time dynamics in the recorded pseudo-trajectories by employing the Monte Carlo dynamics sampling schema, based on a long random sequence of local conformational modifications. This approach may provide a powerful tool for investigating the critical stages of protein folding. Future combination with knowledge-based potentials derived using machine learning techniques offers exciting opportunities to unravel the underlying mechanisms of biological processes in a variety of molecular complexes.
Collapse
|
5
|
Flamholz ZN, Li C, Kelly L. Improving viral annotation with artificial intelligence. mBio 2024; 15:e0320623. [PMID: 39230289 PMCID: PMC11481560 DOI: 10.1128/mbio.03206-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
Viruses of bacteria, "phages," are fundamental, poorly understood components of microbial community structure and function. Additionally, their dependence on hosts for replication positions phages as unique sensors of ecosystem features and environmental pressures. High-throughput sequencing approaches have begun to give us access to the diversity and range of phage populations in complex microbial community samples, and metagenomics is currently the primary tool with which we study phage populations. The study of phages by metagenomic sequencing, however, is fundamentally limited by viral diversity, which results in the vast majority of viral genomes and metagenome-annotated genomes lacking annotation. To harness bacteriophages for applications in human and environmental health and disease, we need new methods to organize and annotate viral sequence diversity. We recently demonstrated that methods that leverage self-supervised representation learning can supplement statistical sequence representations for remote viral protein homology detection in the ocean virome and propose that consideration of the functional content of viral sequences allows for the identification of similarity in otherwise sequence-diverse viruses and viral-like elements for biological discovery. In this review, we describe the potential and pitfalls of large language models for viral annotation. We describe the need for new approaches to annotate viral sequences in metagenomes, the fundamentals of what protein language models are and how one can use them for sequence annotation, the strengths and weaknesses of these models, and future directions toward developing better models for viral annotation more broadly.
Collapse
Affiliation(s)
- Zachary N. Flamholz
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Charlotte Li
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Libusha Kelly
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York, USA
- Department of Microbiology and Immunology, Albert Einstein College of Medicine, Bronx, New York, USA
| |
Collapse
|
6
|
Sanjeevi M, Mohan A, Ramachandran D, Jeyaraman J, Sekar K. CSSP-2.0: A refined consensus method for accurate protein secondary structure prediction. Comput Biol Chem 2024; 112:108158. [PMID: 39053174 DOI: 10.1016/j.compbiolchem.2024.108158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Revised: 06/19/2024] [Accepted: 07/18/2024] [Indexed: 07/27/2024]
Abstract
Studying the relationship between sequences and their corresponding three-dimensional structure assists structural biologists in solving the protein-folding problem. Despite several experimental and in-silico approaches, still understanding or decoding the three-dimensional structures from the sequence remains a mystery. In such cases, the accuracy of the structure prediction plays an indispensable role. To address this issue, an updated web server (CSSP-2.0) has been created to improve the accuracy of our previous version of CSSP by deploying the existing algorithms. It uses input as probabilities and predicts the consensus for the secondary structure as a highly accurate three-state Q3 (helix, strand, and coil). This prediction is achieved using six recent top-performing methods: MUFOLD-SS, RaptorX, PSSpred v4, PSIPRED, JPred v4, and Porter 5.0. CSSP-2.0 validation includes datasets involving various protein classes from the PDB, CullPDB, and AlphaFold databases. Our results indicate a significant improvement in the accuracy of the consensus Q3 prediction. Using CSSP-2.0, crystallographers can sort out the stable regular secondary structures from the entire complex structure, which would aid in inferring the functional annotation of hypothetical proteins. The web server is freely available at https://bioserver3.physics.iisc.ac.in/cgi-bin/cssp-2/.
Collapse
Affiliation(s)
- Madhumathi Sanjeevi
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India; Structural Biology and Bio-Computing Laboratory, Department of Bioinformatics, Alagappa University, Karaikudi 630004, India
| | - Ajitha Mohan
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India
| | | | - Jeyakanthan Jeyaraman
- Structural Biology and Bio-Computing Laboratory, Department of Bioinformatics, Alagappa University, Karaikudi 630004, India.
| | - Kanagaraj Sekar
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India.
| |
Collapse
|
7
|
Heinzinger M, Rost B. Artificial Intelligence Learns Protein Prediction. Cold Spring Harb Perspect Biol 2024; 16:a041458. [PMID: 38858069 PMCID: PMC11368192 DOI: 10.1101/cshperspect.a041458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2024]
Abstract
From AlphaGO over StableDiffusion to ChatGPT, the recent decade of exponential advances in artificial intelligence (AI) has been altering life. In parallel, advances in computational biology are beginning to decode the language of life: AlphaFold2 leaped forward in protein structure prediction, and protein language models (pLMs) replaced expertise and evolutionary information from multiple sequence alignments with information learned from reoccurring patterns in databases of billions of proteins without experimental annotations other than the amino acid sequences. None of those tools could have been developed 10 years ago; all will increase the wealth of experimental data and speed up the cycle from idea to proof. AI is affecting molecular and medical biology at giant steps, and the most important might be the leap toward more powerful protein design.
Collapse
Affiliation(s)
- Michael Heinzinger
- Technical University of Munich (TUM) School of School of Computation, Information and Technology (CIT), Bioinformatics and Computational Biology - i12, 85748 Garching/Munich, Germany
| | - Burkhard Rost
- Technical University of Munich (TUM) School of School of Computation, Information and Technology (CIT), Bioinformatics and Computational Biology - i12, 85748 Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), 85748 Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), 85354 Freising, Germany
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| |
Collapse
|
8
|
Shu P, You G, Li W, Chen Y, Chu Z, Qin D, Wang Y, Zhou H, Zhao L. Cefmetazole sodium as an allosteric effector that regulates the oxygen supply efficiency of adult hemoglobin. J Biomol Struct Dyn 2024; 42:7442-7456. [PMID: 37555593 DOI: 10.1080/07391102.2023.2245043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Accepted: 07/17/2023] [Indexed: 08/10/2023]
Abstract
Allosteric effectors play an important role in regulating the oxygen supply efficiency of hemoglobin for blood storage and disease treatment. However, allosteric effectors that are approved by the US FDA are limited. In this study, cefmetazole sodium (CS) was found to bind adult hemoglobin (HbA) from FDA library (1338 compounds) using surface plasmon resonance imaging high-throughput screening. Using surface plasmon resonance (SPR), the interaction between CS and HbA was verified. The oxygen dissociation curve of HbA after CS interaction showed a significant increase in P50 and theoretical oxygen-release capacity. Acid-base sensitivity (SI) exhibited a decreasing trend, although not significantly different. An oxygen dissociation assay indicated that CS accelerated HbA deoxygenation. Microfluidic modulated spectroscopy showed that CS changed the ratio of the alpha-helix to the beta-sheet of HbA. Molecular docking suggested CS bound to HbA's β-chains via hydrogen bonds, with key amino acids being N282, K225, H545, K625, K675, and V544.The results of molecular dynamics simulations (MD) revealed a stable orientation of the HbA-CS complex. CS did not significantly affect the P50 of bovine hemoglobin, possibly due to the lack of Valβ1 and Hisβ2, indicating that these were the crucial amino acids involved in HbA's oxygen affinity. Competition between the 2,3-Diphosphoglycerate (2,3-DPG) and CS in the HbA interaction was also determined by SPR, molecular docking and MD. In summary, CS could interact with HbA and regulate the oxygen supply efficiency via forming stable hydrogen bonds with the β-chains of HbA, and showed competition with 2,3-DPG.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Peilin Shu
- Institute of Health Service and Transfusion Medicine, Academy of Military Medical Sciences, Academy of Military Science of the Chinese People's Liberation Army, Beijing, P.R. C
| | - Guoxing You
- Institute of Health Service and Transfusion Medicine, Academy of Military Medical Sciences, Academy of Military Science of the Chinese People's Liberation Army, Beijing, P.R. C
| | - Weidan Li
- Institute of Health Service and Transfusion Medicine, Academy of Military Medical Sciences, Academy of Military Science of the Chinese People's Liberation Army, Beijing, P.R. C
| | - Yuzhi Chen
- Institute of Health Service and Transfusion Medicine, Academy of Military Medical Sciences, Academy of Military Science of the Chinese People's Liberation Army, Beijing, P.R. C
| | - Zongtang Chu
- Institute of Health Service and Transfusion Medicine, Academy of Military Medical Sciences, Academy of Military Science of the Chinese People's Liberation Army, Beijing, P.R. C
| | - Dong Qin
- Institute of Health Service and Transfusion Medicine, Academy of Military Medical Sciences, Academy of Military Science of the Chinese People's Liberation Army, Beijing, P.R. C
| | - Ying Wang
- Institute of Health Service and Transfusion Medicine, Academy of Military Medical Sciences, Academy of Military Science of the Chinese People's Liberation Army, Beijing, P.R. C
| | - Hong Zhou
- Institute of Health Service and Transfusion Medicine, Academy of Military Medical Sciences, Academy of Military Science of the Chinese People's Liberation Army, Beijing, P.R. C
| | - Lian Zhao
- Institute of Health Service and Transfusion Medicine, Academy of Military Medical Sciences, Academy of Military Science of the Chinese People's Liberation Army, Beijing, P.R. C
| |
Collapse
|
9
|
Mikulka J, Sen MK, Košnarová P, Hamouz P, Hamouzová K, Sur VP, Šuk J, Bhattacharya S, Soukup J. Molecular Mechanisms of Resistance against PSII-Inhibiting Herbicides in Amaranthus retroflexus from the Czech Republic. Genes (Basel) 2024; 15:904. [PMID: 39062683 PMCID: PMC11275581 DOI: 10.3390/genes15070904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 06/25/2024] [Accepted: 07/09/2024] [Indexed: 07/28/2024] Open
Abstract
Amaranthus retroflexus L. (redroot pigweed) is one of the most problematic weeds in maize, sugar beet, vegetables, and soybean crop fields in Europe. Two pigweed amaranth biotypes (R1 and R2) from the Czech Republic resistant to photosystem II (PSII)-inhibiting herbicides were analyzed in this study. This study aimed to identify the genetic mechanisms that underlie the resistance observed in the biotypes. Additionally, we also intended to establish the use of chlorophyll fluorescence measurement as a rapid and reliable method for confirming herbicide resistance in this weed species. Both biotypes analyzed showed high resistance factors in a dose-response study and were thus confirmed to be resistant to PSII-inhibiting herbicides. A sequence analysis of the D1 protein revealed a well-known Ser-Gly substitution at amino acid position 264 in both biotypes. Molecular docking studies, along with the wild-type and mutant D1 protein's secondary structure analyses, revealed that the S264G mutation did not reduce herbicide affinity but instead indirectly affected the interaction between the target protein and the herbicides. The current study identified the S264G mutation as being responsible for conferring herbicide resistance in the pigweed amaranth biotypes. These findings can provide a strong basis for future studies that might use protein structure and mutation-based approaches to gain further insights into the detailed mechanisms of resistance in this weed species. In many individuals from both biotypes, resistance at a very early stage (BBCH10) of plants was demonstrated several hours after the application of the active ingredients by the chlorophyll fluorescence method. The effective PS II quantum yield parameter can be used as a rapid diagnostic tool for distinguishing between sensitive and resistant plants on an individual level. This method can be useful for identifying herbicide-resistant weed biotypes in the field, which can help farmers and weed management practitioners develop more effective weed control tactics.
Collapse
Affiliation(s)
- Jakub Mikulka
- Department of Agroecology and Crop Production, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýcká 1176, 165 00 Prague, Czech Republic; (J.M.); (M.K.S.); (P.K.); (P.H.); (K.H.); (J.Š.); (S.B.)
| | - Madhab Kumar Sen
- Department of Agroecology and Crop Production, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýcká 1176, 165 00 Prague, Czech Republic; (J.M.); (M.K.S.); (P.K.); (P.H.); (K.H.); (J.Š.); (S.B.)
| | - Pavlína Košnarová
- Department of Agroecology and Crop Production, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýcká 1176, 165 00 Prague, Czech Republic; (J.M.); (M.K.S.); (P.K.); (P.H.); (K.H.); (J.Š.); (S.B.)
| | - Pavel Hamouz
- Department of Agroecology and Crop Production, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýcká 1176, 165 00 Prague, Czech Republic; (J.M.); (M.K.S.); (P.K.); (P.H.); (K.H.); (J.Š.); (S.B.)
| | - Kateřina Hamouzová
- Department of Agroecology and Crop Production, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýcká 1176, 165 00 Prague, Czech Republic; (J.M.); (M.K.S.); (P.K.); (P.H.); (K.H.); (J.Š.); (S.B.)
| | - Vishma Pratap Sur
- Institute of Microbiology, The Czech Academy of Sciences, Centre Algatech, Novohradská 237-Opatovický Mlýn, 379 01 Třebon, Czech Republic;
| | - Jaromír Šuk
- Department of Agroecology and Crop Production, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýcká 1176, 165 00 Prague, Czech Republic; (J.M.); (M.K.S.); (P.K.); (P.H.); (K.H.); (J.Š.); (S.B.)
| | - Soham Bhattacharya
- Department of Agroecology and Crop Production, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýcká 1176, 165 00 Prague, Czech Republic; (J.M.); (M.K.S.); (P.K.); (P.H.); (K.H.); (J.Š.); (S.B.)
| | - Josef Soukup
- Department of Agroecology and Crop Production, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýcká 1176, 165 00 Prague, Czech Republic; (J.M.); (M.K.S.); (P.K.); (P.H.); (K.H.); (J.Š.); (S.B.)
| |
Collapse
|
10
|
Spadaro A, Sharma A, Dehzangi I. Predicting lysine methylation sites using a convolutional neural network. Methods 2024; 226:127-132. [PMID: 38604414 DOI: 10.1016/j.ymeth.2024.04.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 12/15/2023] [Accepted: 04/07/2024] [Indexed: 04/13/2024] Open
Abstract
Protein lysine methylation is a particular type of post translational modification that plays an important role in both histone and non-histone function regulation in proteins. Deregulation caused by lysine methyltransferases has been identified as the cause of several diseases including cancer as well as both mental and developmental disorders. Identifying lysine methylation sites is a critical step in both early diagnosis and drug design. This study proposes a new Machine Learning method called CNN-Meth for predicting lysine methylation sites using a convolutional neural network (CNN). Our model is trained using evolutionary, structural, and physicochemical-based presentation along with binary encoding. Unlike previous studies, instead of extracting handcrafted features, we use CNN to automatically extract features from different presentations of amino acids to avoid information loss. Automated feature extraction from these representations of amino acids as well as CNN as a classifier have never been used for this problem. Our results demonstrate that CNN-Meth can significantly outperform previous methods for predicting methylation sites. It achieves 96.0%, 85.1%, 96.4%, and 0.65 in terms of Accuracy, Sensitivity, Specificity, and Matthew's Correlation Coefficient (MCC), respectively. CNN-Meth and its source code are publicly available at https://github.com/MLBC-lab/CNN-Meth.
Collapse
Affiliation(s)
- Austin Spadaro
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, United States
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia; Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Iman Dehzangi
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, United States; Department of Computer Science, Rutgers University, Camden, NJ, United States.
| |
Collapse
|
11
|
Broz M, Jukič M, Bren U. Naive Prediction of Protein Backbone Phi and Psi Dihedral Angles Using Deep Learning. Molecules 2023; 28:7046. [PMID: 37894526 PMCID: PMC10609058 DOI: 10.3390/molecules28207046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/06/2023] [Accepted: 10/09/2023] [Indexed: 10/29/2023] Open
Abstract
Protein structure prediction represents a significant challenge in the field of bioinformatics, with the prediction of protein structures using backbone dihedral angles recently achieving significant progress due to the rise of deep neural network research. However, there is a trend in protein structure prediction research to employ increasingly complex neural networks and contributions from multiple models. This study, on the other hand, explores how a single model transparently behaves using sequence data only and what can be expected from the predicted angles. To this end, the current paper presents data acquisition, deep learning model definition, and training toward the final protein backbone angle prediction. The method applies a simple fully connected neural network (FCNN) model that takes only the primary structure of the protein with a sliding window of size 21 as input to predict protein backbone ϕ and ψ dihedral angles. Despite its simplicity, the model shows surprising accuracy for the ϕ angle prediction and somewhat lower accuracy for the ψ angle prediction. Moreover, this study demonstrates that protein secondary structure prediction is also possible with simple neural networks that take in only the protein amino-acid residue sequence, but more complex models are required for higher accuracies.
Collapse
Affiliation(s)
- Matic Broz
- Faculty of Chemistry and Chemical Engineering, University of Maribor, Smetanova ulica 17, SI-2000 Maribor, Slovenia
| | - Marko Jukič
- Faculty of Chemistry and Chemical Engineering, University of Maribor, Smetanova ulica 17, SI-2000 Maribor, Slovenia
- Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška ulica 8, SI-6000 Koper, Slovenia
- Institute of Environmental Protection and Sensors, Beloruska ulica 7, SI-2000 Maribor, Slovenia
| | - Urban Bren
- Faculty of Chemistry and Chemical Engineering, University of Maribor, Smetanova ulica 17, SI-2000 Maribor, Slovenia
- Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška ulica 8, SI-6000 Koper, Slovenia
- Institute of Environmental Protection and Sensors, Beloruska ulica 7, SI-2000 Maribor, Slovenia
| |
Collapse
|
12
|
Jin C, Patel A, Peters J, Hodawadekar S, Kalyanaraman R. Quantum Cascade Laser Based Infrared Spectroscopy: A New Paradigm for Protein Secondary Structure Measurement. Pharm Res 2023; 40:1507-1517. [PMID: 36329374 DOI: 10.1007/s11095-022-03422-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 10/19/2022] [Indexed: 11/06/2022]
Abstract
Mid-infrared spectroscopy is one of the major analytical techniques employed for measurements of protein structure in solution. Traditional Fourier Transform-Infrared (FT-IR) measurement is limited by its blackbody light source that is inherently spatially incoherent and has low optical power output. This limitation is pronounced when working with proteins in aqueous solutions. Strong absorbance of water in protein amide I region 1600-1700 cm-1 restricts light path length to <10 μm and imposes significant experimental challenges in sample and flow cell handling. Emerging laser spectroscopic techniques use high-power coherent laser as light source that overcomes the limitation in FT-IR measurement. In this study, we employed an innovative infrared spectrometer that uses quantum cascade laser (QCL) as light source. Continuous infrared radiation from this laser source can be swiftly swept within the amide I region (1600-1700 cm-1) and amide II region (1500-1600 cm-1), which makes this technique ideal for protein secondary structure study. Protein solutions as low as 0.5 mg/mL were measured rapidly without any sample preparation. Infrared spectra of model proteins were thus collected, and a chemometric model based on partial least squares regression was developed to quantify α-helix and β-strand motifs in protein secondary structure. The model was applied to measurement of the native secondary structure of commercial therapeutic proteins and bovine serum albumin (BSA) and in thermal degradation studies.
Collapse
Affiliation(s)
- Chunguang Jin
- Global Quality Analytical Science & Technology, Bristol Myers Squibb, New Brunswick, New Jersey, 08901, USA.
| | - Amrish Patel
- Global Quality Analytical Science & Technology, Bristol Myers Squibb, New Brunswick, New Jersey, 08901, USA
| | - Jeremy Peters
- Global Quality Analytical Science & Technology, Bristol Myers Squibb, New Brunswick, New Jersey, 08901, USA
| | | | - Ravi Kalyanaraman
- Global Quality Analytical Science & Technology, Bristol Myers Squibb, New Brunswick, New Jersey, 08901, USA.
| |
Collapse
|
13
|
Shea A, Bartz J, Zhang L, Dong X. Predicting mutational function using machine learning. MUTATION RESEARCH. REVIEWS IN MUTATION RESEARCH 2023; 791:108457. [PMID: 36965820 PMCID: PMC10239318 DOI: 10.1016/j.mrrev.2023.108457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 03/11/2023] [Accepted: 03/20/2023] [Indexed: 03/27/2023]
Abstract
Genetic variations are one of the major causes of phenotypic variations between human individuals. Although beneficial as being the substrate of evolution, germline mutations may cause diseases, including Mendelian diseases and complex diseases such as diabetes and heart diseases. Mutations occurring in somatic cells are a main cause of cancer and likely cause age-related phenotypes and other age-related diseases. Because of the high abundance of genetic variations in the human genome, i.e., millions of germline variations per human subject and thousands of additional somatic mutations per cell, it is technically challenging to experimentally verify the function of every possible mutation and their interactions. Significant progress has been made to solve this problem using computational approaches, especially machine learning (ML). Here, we review the progress and achievements made in recent years in this field of research. We classify the computational models in two ways: one according to their prediction goals including protein structural alterations, gene expression changes, and disease risks, and the other according to their methodologies, including non-machine learning methods, classical machine learning methods, and deep neural network methods. For models in each category, we discuss their architecture, prediction accuracy, and potential limitations. This review provides new insights into the applications and future directions of computational approaches in understanding the role of mutations in aging and disease.
Collapse
Affiliation(s)
- Anthony Shea
- Institute on the Biology of Aging and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA; Department of Genetics, Cell Biology and Development, University of Minnesota, Minneapolis, MN 55455, USA
| | - Josh Bartz
- Institute on the Biology of Aging and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA; Department of Genetics, Cell Biology and Development, University of Minnesota, Minneapolis, MN 55455, USA; Bioinformatics and Computational Biology Program, University of Minnesota, Minneapolis, MN 55455, USA
| | - Lei Zhang
- Institute on the Biology of Aging and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA; Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Xiao Dong
- Institute on the Biology of Aging and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA; Department of Genetics, Cell Biology and Development, University of Minnesota, Minneapolis, MN 55455, USA.
| |
Collapse
|
14
|
Mufassirin MMM, Newton MAH, Sattar A. Artificial intelligence for template-free protein structure prediction: a comprehensive review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10350-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
15
|
Ismi DP, Pulungan R, Afiahayati. Deep learning for protein secondary structure prediction: Pre and post-AlphaFold. Comput Struct Biotechnol J 2022; 20:6271-6286. [PMID: 36420164 PMCID: PMC9678802 DOI: 10.1016/j.csbj.2022.11.012] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 11/05/2022] [Accepted: 11/05/2022] [Indexed: 11/13/2022] Open
Abstract
This paper aims to provide a comprehensive review of the trends and challenges of deep neural networks for protein secondary structure prediction (PSSP). In recent years, deep neural networks have become the primary method for protein secondary structure prediction. Previous studies showed that deep neural networks had uplifted the accuracy of three-state secondary structure prediction to more than 80%. Favored deep learning methods, such as convolutional neural networks, recurrent neural networks, inception networks, and graph neural networks, have been implemented in protein secondary structure prediction. Methods adapted from natural language processing (NLP) and computer vision are also employed, including attention mechanism, ResNet, and U-shape networks. In the post-AlphaFold era, PSSP studies focus on different objectives, such as enhancing the quality of evolutionary information and exploiting protein language models as the PSSP input. The recent trend to utilize pre-trained language models as input features for secondary structure prediction provides a new direction for PSSP studies. Moreover, the state-of-the-art accuracy achieved by previous PSSP models is still below its theoretical limit. There are still rooms for improvement to be made in the field.
Collapse
Affiliation(s)
- Dewi Pramudi Ismi
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
- Department of Infomatics, Faculty of Industrial Technology, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
| | - Reza Pulungan
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| | - Afiahayati
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| |
Collapse
|
16
|
Nacar C. Propensities of Some Amino Acid Pairings in α-Helices Vary with Length. Protein J 2022; 41:551-562. [PMID: 36169766 DOI: 10.1007/s10930-022-10076-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/15/2022] [Indexed: 11/29/2022]
Abstract
The results of secondary structure prediction methods are widely used in applications in biotechnology and bioinformatics. However, the accuracy limit of these methods could be improved up to 92%. One approach to achieve this goal is to harvest information from the primary structure of the peptide. This study aims to contribute to this goal by investigating the variations in propensity of amino acid pairings to α-helices in globular proteins depending on helix length. (n):(n + 4) residue pairings were determined using a comprehensive peptide data set according to backbone hydrogen bond criterion which states that backbone hydrogen bond is the dominant driving force of protein folding. Helix length is limited to 13 to 26 residues. Findings of this study show that propensities of ALA:GLY and GLY:GLU pairings to α-helix in globular protein increase with increasing helix length but of ALA:ALA and ALA:VAL decrease. While the frequencies of ILE:ALA, LEU:ALA, LEU:GLN, LEU:GLU, LEU:LEU, MET:ILE and VAL:LEU pairings remain roughly constant with length, the 25 residue pairings have varying propensities in narrow helix lengths. The remaining pairings have no prominent propensity to α-helices.
Collapse
Affiliation(s)
- Cevdet Nacar
- Department of Biophysics, School of Medicine, Marmara University, Istanbul, Turkey.
| |
Collapse
|
17
|
A multifaceted strategy to improve recombinant expression and structural characterisation of a Trypanosoma invariant surface protein. Sci Rep 2022; 12:12706. [PMID: 35882923 PMCID: PMC9325691 DOI: 10.1038/s41598-022-16958-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 07/19/2022] [Indexed: 11/16/2022] Open
Abstract
Identification of a protein minimal fragment amenable to crystallisation can be time- and labour intensive especially if large amounts are required and the protein has a complex fold and functionally important post-translational modifications. In addition, a lack of homologues and structural information can further complicate the design of a minimal expression construct. Recombinant expression in E. coli promises high yields, low costs and fast turnover times, but falls short for many extracellular, eukaryotic proteins. Eukaryotic expression systems provide an alternative but are costly, slow and require special handling and equipment. Using a member of a structurally uncharacterized, eukaryotic receptor family as an example we employ hydrogen–deuterium exchange mass spectrometry (HDX-MS) guided construct design in conjunction with truncation scanning and targeted expression host switching to identify a minimal expression construct that can be produced with high yields and moderate costs.
Collapse
|
18
|
Biró B, Zhao B, Kurgan L. Complementarity of the residue-level protein function and structure predictions in human proteins. Comput Struct Biotechnol J 2022; 20:2223-2234. [PMID: 35615015 PMCID: PMC9118482 DOI: 10.1016/j.csbj.2022.05.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 05/02/2022] [Accepted: 05/02/2022] [Indexed: 11/24/2022] Open
Abstract
Sequence-based predictors of the residue-level protein function and structure cover a broad spectrum of characteristics including intrinsic disorder, secondary structure, solvent accessibility and binding to nucleic acids. They were catalogued and evaluated in numerous surveys and assessments. However, methods focusing on a given characteristic are studied separately from predictors of other characteristics, while they are typically used on the same proteins. We fill this void by studying complementarity of a representative collection of methods that target different predictions using a large, taxonomically consistent, and low similarity dataset of human proteins. First, we bridge the gap between the communities that develop structure-trained vs. disorder-trained predictors of binding residues. Motivated by a recent study of the protein-binding residue predictions, we empirically find that combining the structure-trained and disorder-trained predictors of the DNA-binding and RNA-binding residues leads to substantial improvements in predictive quality. Second, we investigate whether diverse predictors generate results that accurately reproduce relations between secondary structure, solvent accessibility, interaction sites, and intrinsic disorder that are present in the experimental data. Our empirical analysis concludes that predictions accurately reflect all combinations of these relations. Altogether, this study provides unique insights that support combining results produced by diverse residue-level predictors of protein function and structure.
Collapse
Affiliation(s)
- Bálint Biró
- Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Gödöllő, Hungary
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
19
|
Pritam M, Singh G, Kumar R, Singh SP. Screening of potential antigens from whole proteome and development of multi-epitope vaccine against Rhizopus delemar using immunoinformatics approaches. J Biomol Struct Dyn 2022; 41:2118-2145. [PMID: 35067195 DOI: 10.1080/07391102.2022.2028676] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
Mucormycosis is a deadly fungal disease mainly caused by Rhizopus oryzae (strain 99-880), also known as Rhizopus delemar. Previously, mucormycosis occurs in immunocompromised patients of diabetes mellitus, cancer, organ transplant, etc. But there was a drastic increase in mucormycosis cases in the ongoing COVID-19 pandemic. Despite several available therapies and antifungal treatments, the mortality rate of mucormycosis is about more than 50%. Currently, there is no vaccine available in the market for mucormycosis that urgently needs to develop a potential vaccine against mucormycosis with high efficacy. In the present study, we have screened 4 genome-derived predicted antigens (GDPA) through sequential filtration of the whole proteome of R. delemar using different benchmarked bioinformatics tools. These 4 GDPA along with 4 randomly selected experimentally reported antigens (ERA) were sourced for prediction of B- and T- cell epitopes and utilized in designing of two potential multi-epitope vaccine candidates which can induce both innate and adaptive immunity against R. delemar. Besides these, comparative immune simulation studies and in silico cloning were performed using L. lactis as an expression system for their possible uses as oral vaccines. This is the first multi-epitope vaccine designed against R. delemar through systematic pipelined reverse vaccinology and immunoinformatic approaches. Although the wet-lab based experimental validation of designed vaccines is required before testing in the preclinical model, the current study will significantly help in reducing the cost of experimentation as well as improving the efficacy of vaccine therapy against mucormycosis and other pathogenic diseases.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Manisha Pritam
- Amity Institute of Biotechnology, Amity University Uttar Pradesh, Lucknow, India
| | - Garima Singh
- Amity Institute of Biotechnology, Amity University Uttar Pradesh, Lucknow, India
| | - Rajnish Kumar
- Amity Institute of Biotechnology, Amity University Uttar Pradesh, Lucknow, India
| | | |
Collapse
|
20
|
Newton MAH, Mataeimoghadam F, Zaman R, Sattar A. Secondary structure specific simpler prediction models for protein backbone angles. BMC Bioinformatics 2022; 23:6. [PMID: 34983370 PMCID: PMC8728911 DOI: 10.1186/s12859-021-04525-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 12/07/2021] [Indexed: 11/10/2022] Open
Abstract
Motivation Protein backbone angle prediction has achieved significant accuracy improvement with the development of deep learning methods. Usually the same deep learning model is used in making prediction for all residues regardless of the categories of secondary structures they belong to. In this paper, we propose to train separate deep learning models for each category of secondary structures. Machine learning methods strive to achieve generality over the training examples and consequently loose accuracy. In this work, we explicitly exploit classification knowledge to restrict generalisation within the specific class of training examples. This is to compensate the loss of generalisation by exploiting specialisation knowledge in an informed way. Results The new method named SAP4SS obtains mean absolute error (MAE) values of 15.59, 18.87, 6.03, and 21.71 respectively for four types of backbone angles \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\phi$$\end{document}ϕ, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\psi$$\end{document}ψ, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\theta$$\end{document}θ, and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\tau$$\end{document}τ. Consequently, SAP4SS significantly outperforms existing state-of-the-art methods SAP, OPUS-TASS, and SPOT-1D: the differences in MAE for all four types of angles are from 1.5 to 4.1% compared to the best known results. Availability SAP4SS along with its data is available from https://gitlab.com/mahnewton/sap4ss.
Collapse
Affiliation(s)
- M A Hakim Newton
- School of Information and Communication Technology, Griffith University, Brisbane, Australia. .,Institute of Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
| | | | - Rianon Zaman
- School of Information and Communication Technology, Griffith University, Brisbane, Australia
| | - Abdul Sattar
- School of Information and Communication Technology, Griffith University, Brisbane, Australia.,Institute of Integrated and Intelligent Systems, Griffith University, Brisbane, Australia
| |
Collapse
|
21
|
Miao Z, Wang Q, Xiao X, Kamal GM, Song L, Zhang X, Li C, Zhou X, Jiang B, Liu M. CSI-LSTM: a web server to predict protein secondary structure using bidirectional long short term memory and NMR chemical shifts. JOURNAL OF BIOMOLECULAR NMR 2021; 75:393-400. [PMID: 34510297 DOI: 10.1007/s10858-021-00383-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 09/06/2021] [Indexed: 06/13/2023]
Abstract
Protein secondary structure provides rich structural information, hence the description and understanding of protein structure relies heavily on it. Identification or prediction of secondary structures therefore plays an important role in protein research. In protein NMR studies, it is more convenient to predict secondary structures from chemical shifts as compared to the traditional determination methods based on inter-nuclear distances provided by NOESY experiment. In recent years, there was a significant improvement observed in deep neural networks, which had been applied in many research fields. Here we proposed a deep neural network based on bidirectional long short term memory (biLSTM) to predict protein 3-state secondary structure using NMR chemical shifts of backbone nuclei. While comparing with the existing methods the proposed method showed better prediction accuracy. Based on the proposed method, a web server has been built to provide protein secondary structure prediction service.
Collapse
Affiliation(s)
- Zhiwei Miao
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, 430071, Wuhan, China
| | - Qianqian Wang
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, 430071, Wuhan, China
| | - Xiongjie Xiao
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, 430071, Wuhan, China
| | - Ghulam Mustafa Kamal
- Department of Chemistry, Khwaja Fareed University of Engineering & Information Technology, Rahim Yar Khan, Punjab, 64200, Pakistan
| | - Linhong Song
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, 430071, Wuhan, China
- University of Chinese Academy of Sciences, Beijing, 10049, China
| | - Xu Zhang
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, 430071, Wuhan, China
- University of Chinese Academy of Sciences, Beijing, 10049, China
| | - Conggang Li
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, 430071, Wuhan, China
- University of Chinese Academy of Sciences, Beijing, 10049, China
| | - Xin Zhou
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, 430071, Wuhan, China
- University of Chinese Academy of Sciences, Beijing, 10049, China
| | - Bin Jiang
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, 430071, Wuhan, China.
- University of Chinese Academy of Sciences, Beijing, 10049, China.
| | - Maili Liu
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, 430071, Wuhan, China.
- University of Chinese Academy of Sciences, Beijing, 10049, China.
| |
Collapse
|
22
|
Narayanan A, Dhinojwala A, Joy A. Design principles for creating synthetic underwater adhesives. Chem Soc Rev 2021; 50:13321-13345. [PMID: 34751690 DOI: 10.1039/d1cs00316j] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Water and adhesives have a conflicting relationship as demonstrated by the failure of most man-made adhesives in underwater environments. However, living creatures routinely adhere to substrates underwater. For example, sandcastle worms create protective reefs underwater by secreting a cocktail of protein glue that binds mineral particles together, and mussels attach themselves to rocks near tide-swept sea shores using byssal threads formed from their extracellular secretions. Over the past few decades, the physicochemical examination of biological underwater adhesives has begun to decipher the mysteries behind underwater adhesion. These naturally occurring adhesives have inspired the creation of several synthetic materials that can stick underwater - a task that was once thought to be "impossible". This review provides a comprehensive overview of the progress in the science of underwater adhesion over the past few decades. In this review, we introduce the basic thermodynamics processes and kinetic parameters involved in adhesion. Second, we describe the challenges brought by water when adhering underwater. Third, we explore the adhesive mechanisms showcased by mussels and sandcastle worms to overcome the challenges brought by water. We then present a detailed review of synthetic underwater adhesives that have been reported to date. Finally, we discuss some potential applications of underwater adhesives and the current challenges in the field by using a tandem analysis of the reported chemical structures and their adhesive strength. This review is aimed to inspire and facilitate the design of novel synthetic underwater adhesives, that will, in turn expand our understanding of the physical and chemical parameters that influence underwater adhesion.
Collapse
Affiliation(s)
- Amal Narayanan
- School of Polymer Science and Polymer Engineering, The University of Akron, Akron, OH 44325, USA.
| | - Ali Dhinojwala
- School of Polymer Science and Polymer Engineering, The University of Akron, Akron, OH 44325, USA.
| | - Abraham Joy
- School of Polymer Science and Polymer Engineering, The University of Akron, Akron, OH 44325, USA.
| |
Collapse
|
23
|
Moffat L, Jones DT. Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework. Bioinformatics 2021; 37:3744-3751. [PMID: 34213528 PMCID: PMC8570780 DOI: 10.1093/bioinformatics/btab491] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 06/08/2021] [Accepted: 06/30/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Over the past 50 years, our ability to model protein sequences with evolutionary information has progressed in leaps and bounds. However, even with the latest deep learning methods, the modelling of a critically important class of proteins, single orphan sequences, remains unsolved. RESULTS By taking a bioinformatics approach to semi-supervised machine learning, we develop Profile Augmentation of Single Sequences (PASS), a simple but powerful framework for building accurate single-sequence methods. To demonstrate the effectiveness of PASS we apply it to the mature field of secondary structure prediction. In doing so we develop S4PRED, the successor to the open-source PSIPRED-Single method, which achieves an unprecedented Q3 score of 75.3% on the standard CB513 test. PASS provides a blueprint for the development of a new generation of predictive methods, advancing our ability to model individual protein sequences. AVAILABILITY AND IMPLEMENTATION The S4PRED model is available as open source software on the PSIPRED GitHub repository (https://github.com/psipred/s4pred), along with documentation. It will also be provided as a part of the PSIPRED web service (http://bioinf.cs.ucl.ac.uk/psipred/). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lewis Moffat
- Department of Computer Science, University College London, London WC1E 6BT, UK
- Biomedical Data Science Laboratory, The Francis Crick Institute, London NW1 1AT, UK
| | - David T Jones
- Department of Computer Science, University College London, London WC1E 6BT, UK
- Biomedical Data Science Laboratory, The Francis Crick Institute, London NW1 1AT, UK
| |
Collapse
|
24
|
Ho CT, Huang YW, Chen TR, Lo CH, Lo WC. Discovering the Ultimate Limits of Protein Secondary Structure Prediction. Biomolecules 2021; 11:1627. [PMID: 34827624 PMCID: PMC8615938 DOI: 10.3390/biom11111627] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 10/25/2021] [Accepted: 10/28/2021] [Indexed: 12/29/2022] Open
Abstract
Secondary structure prediction (SSP) of proteins is an important structural biology technique with many applications. There have been ~300 algorithms published in the past seven decades with fierce competition in accuracy. In the first 60 years, the accuracy of three-state SSP rose from ~56% to 81%; after that, it has long stayed at 81-86%. In the 1990s, the theoretical limit of three-state SSP accuracy had been estimated to be 88%. Thus, SSP is now generally considered not challenging or too challenging to improve. However, we found that the limit of three-state SSP might be underestimated. Besides, there is still much room for improving segment-based and eight-state SSPs, but the limits of these emerging topics have not been determined. This work performs large-scale sequence and structural analyses to estimate SSP accuracy limits and assess state-of-the-art SSP methods. The limit of three-state SSP is re-estimated to be ~92%, 4-5% higher than previously expected, indicating that SSP is still challenging. The estimated limit of eight-state SSP is 84-87%. Several proposals for improving future SSP algorithms are made based on our results. We hope that these findings will help move forward the development of SSP and all its applications.
Collapse
Affiliation(s)
- Chia-Tzu Ho
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Yu-Wei Huang
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
| | - Chia-Hua Lo
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (C.-T.H.); (Y.-W.H.); (T.-R.C.); (C.-H.L.)
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
- The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
| |
Collapse
|
25
|
Chen TR, Juan SH, Huang YW, Lin YC, Lo WC. A secondary structure-based position-specific scoring matrix applied to the improvement in protein secondary structure prediction. PLoS One 2021; 16:e0255076. [PMID: 34320027 PMCID: PMC8318245 DOI: 10.1371/journal.pone.0255076] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 07/11/2021] [Indexed: 11/18/2022] Open
Abstract
Protein secondary structure prediction (SSP) has a variety of applications; however, there has been relatively limited improvement in accuracy for years. With a vision of moving forward all related fields, we aimed to make a fundamental advance in SSP. There have been many admirable efforts made to improve the machine learning algorithm for SSP. This work thus took a step back by manipulating the input features. A secondary structure element-based position-specific scoring matrix (SSE-PSSM) is proposed, based on which a new set of machine learning features can be established. The feasibility of this new PSSM was evaluated by rigid independent tests with training and testing datasets sharing <25% sequence identities. In all experiments, the proposed PSSM outperformed the traditional amino acid PSSM. This new PSSM can be easily combined with the amino acid PSSM, and the improvement in accuracy was remarkable. Preliminary tests made by combining the SSE-PSSM and well-known SSP methods showed 2.0% and 5.2% average improvements in three- and eight-state SSP accuracies, respectively. If this PSSM can be integrated into state-of-the-art SSP methods, the overall accuracy of SSP may break the current restriction and eventually bring benefit to all research and applications where secondary structure prediction plays a vital role during development. To facilitate the application and integration of the SSE-PSSM with modern SSP methods, we have established a web server and standalone programs for generating SSE-PSSM available at http://10.life.nctu.edu.tw/SSE-PSSM.
Collapse
Affiliation(s)
- Teng-Ruei Chen
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Sheng-Hung Juan
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Yu-Wei Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Yen-Cheng Lin
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Wei-Cheng Lo
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- * E-mail:
| |
Collapse
|
26
|
Goodswen SJ, Kennedy PJ, Ellis JT. Predicting Protein Therapeutic Candidates for Bovine Babesiosis Using Secondary Structure Properties and Machine Learning. Front Genet 2021; 12:716132. [PMID: 34367264 PMCID: PMC8343536 DOI: 10.3389/fgene.2021.716132] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Accepted: 06/28/2021] [Indexed: 12/02/2022] Open
Abstract
Bovine babesiosis causes significant annual global economic loss in the beef and dairy cattle industry. It is a disease instigated from infection of red blood cells by haemoprotozoan parasites of the genus Babesia in the phylum Apicomplexa. Principal species are Babesia bovis, Babesia bigemina, and Babesia divergens. There is no subunit vaccine. Potential therapeutic targets against babesiosis include members of the exportome. This study investigates the novel use of protein secondary structure characteristics and machine learning algorithms to predict exportome membership probabilities. The premise of the approach is to detect characteristic differences that can help classify one protein type from another. Structural properties such as a protein’s local conformational classification states, backbone torsion angles ϕ (phi) and ψ (psi), solvent-accessible surface area, contact number, and half-sphere exposure are explored here as potential distinguishing protein characteristics. The presented methods that exploit these structural properties via machine learning are shown to have the capacity to detect exportome from non-exportome Babesia bovis proteins with an 86–92% accuracy (based on 10-fold cross validation and independent testing). These methods are encapsulated in freely available Linux pipelines setup for automated, high-throughput processing. Furthermore, proposed therapeutic candidates for laboratory investigation are provided for B. bovis, B. bigemina, and two other haemoprotozoan species, Babesia canis, and Plasmodium falciparum.
Collapse
Affiliation(s)
- Stephen J Goodswen
- School of Life Sciences, University of Technology Sydney, Ultimo, NSW, Australia
| | - Paul J Kennedy
- School of Computer Science, Faculty of Engineering and Information Technology and the Australian Artificial Intelligence Institute, University of Technology Sydney, Ultimo, NSW, Australia
| | - John T Ellis
- School of Life Sciences, University of Technology Sydney, Ultimo, NSW, Australia
| |
Collapse
|
27
|
Bernhofer M, Dallago C, Karl T, Satagopam V, Heinzinger M, Littmann M, Olenyi T, Qiu J, Schütze K, Yachdav G, Ashkenazy H, Ben-Tal N, Bromberg Y, Goldberg T, Kajan L, O’Donoghue S, Sander C, Schafferhans A, Schlessinger A, Vriend G, Mirdita M, Gawron P, Gu W, Jarosz Y, Trefois C, Steinegger M, Schneider R, Rost B. PredictProtein - Predicting Protein Structure and Function for 29 Years. Nucleic Acids Res 2021; 49:W535-W540. [PMID: 33999203 PMCID: PMC8265159 DOI: 10.1093/nar/gkab354] [Citation(s) in RCA: 166] [Impact Index Per Article: 41.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 04/06/2021] [Accepted: 05/10/2021] [Indexed: 12/12/2022] Open
Abstract
Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.
Collapse
Affiliation(s)
- Michael Bernhofer
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
- TUM Graduate School CeDoSIA, Boltzmannstr 11, 85748 Garching, Germany
| | - Christian Dallago
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
- TUM Graduate School CeDoSIA, Boltzmannstr 11, 85748 Garching, Germany
| | - Tim Karl
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
| | - Venkata Satagopam
- Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
- ELIXIR Luxembourg (ELIXIR-LU) Node, University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
- TUM Graduate School CeDoSIA, Boltzmannstr 11, 85748 Garching, Germany
| | - Maria Littmann
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
- TUM Graduate School CeDoSIA, Boltzmannstr 11, 85748 Garching, Germany
| | - Tobias Olenyi
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
| | - Jiajun Qiu
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
- Department of Otolaryngology Head & Neck Surgery, The Ninth People's Hospital & Ear Institute, School of Medicine & Shanghai Key Laboratory of Translational Medicine on Ear and Nose Diseases, Shanghai Jiao Tong University, Shanghai, China
| | - Konstantin Schütze
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
| | - Guy Yachdav
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
| | - Haim Ashkenazy
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, 69978 Tel Aviv, Israel
| | - Nir Ben-Tal
- Department of Biochemistry & Molecular Biology, George S. Wise Faculty of Life Sciences, Tel Aviv University, 69978 Tel Aviv, Israel
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ 08901, USA
| | - Tatyana Goldberg
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
| | - Laszlo Kajan
- Roche Polska Sp. z o.o., Domaniewska 39B, 02–672 Warsaw, Poland
| | | | - Chris Sander
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Department of Cell Biology, Harvard Medical School, Boston, MA 02215, USA
- Broad Institute of MIT and Harvard, Boston, MA 02142, USA
| | - Andrea Schafferhans
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
- HSWT (Hochschule Weihenstephan Triesdorf | University of Applied Sciences), Department of Bioengineering Sciences, Am Hofgarten 10, 85354 Freising, Germany
| | - Avner Schlessinger
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | | | - Milot Mirdita
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Piotr Gawron
- Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
| | - Wei Gu
- Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
- ELIXIR Luxembourg (ELIXIR-LU) Node, University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
| | - Yohan Jarosz
- Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
- ELIXIR Luxembourg (ELIXIR-LU) Node, University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
| | - Christophe Trefois
- Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
- ELIXIR Luxembourg (ELIXIR-LU) Node, University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Reinhard Schneider
- Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
- ELIXIR Luxembourg (ELIXIR-LU) Node, University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
28
|
Dallago C, Schütze K, Heinzinger M, Olenyi T, Littmann M, Lu AX, Yang KK, Min S, Yoon S, Morton JT, Rost B. Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets. Curr Protoc 2021; 1:e113. [PMID: 33961736 DOI: 10.1002/cpz1.113] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1: Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2: Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3: Overlay sequence annotations onto a protein space visualization Basic Protocol 4: Train a machine learning classifier on protein embeddings Alternate Protocol 1: Generate 3D instead of 2D visualizations Alternate Protocol 2: Visualize protein solubility instead of protein subcellular localization Support Protocol: Join embedding generation and sequence space visualization in a pipeline.
Collapse
Affiliation(s)
- Christian Dallago
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany
| | - Konstantin Schütze
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany
| | - Tobias Olenyi
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany
| | - Maria Littmann
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany
| | - Amy X Lu
- Department of Computer Science, University of Toronto, Toronto, Canada & Vector Institute
| | - Kevin K Yang
- Microsoft Research New England, Cambridge, Massachusetts
| | - Seonwoo Min
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
| | - Sungroh Yoon
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| | - James T Morton
- Center for Computational Biology, Flatiron Institute, New York, New York
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany.,Institute for Advanced Study (TUM-IAS), Garching/Munich, Germany.,TUM School of Life Sciences Weihenstephan (WZW), Freising, Germany.,Columbia University, Department of Biochemistry and Molecular Biophysics, New York, New York.,New York Consortium on Membrane Protein Structure (NYCOMPS), New York, New York
| |
Collapse
|
29
|
Slater O, Miller B, Kontoyianni M. Decoding Protein-protein Interactions: An Overview. Curr Top Med Chem 2021; 20:855-882. [PMID: 32101126 DOI: 10.2174/1568026620666200226105312] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2019] [Revised: 11/27/2019] [Accepted: 11/27/2019] [Indexed: 12/24/2022]
Abstract
Drug discovery has focused on the paradigm "one drug, one target" for a long time. However, small molecules can act at multiple macromolecular targets, which serves as the basis for drug repurposing. In an effort to expand the target space, and given advances in X-ray crystallography, protein-protein interactions have become an emerging focus area of drug discovery enterprises. Proteins interact with other biomolecules and it is this intricate network of interactions that determines the behavior of the system and its biological processes. In this review, we briefly discuss networks in disease, followed by computational methods for protein-protein complex prediction. Computational methodologies and techniques employed towards objectives such as protein-protein docking, protein-protein interactions, and interface predictions are described extensively. Docking aims at producing a complex between proteins, while interface predictions identify a subset of residues on one protein that could interact with a partner, and protein-protein interaction sites address whether two proteins interact. In addition, approaches to predict hot spots and binding sites are presented along with a representative example of our internal project on the chemokine CXC receptor 3 B-isoform and predictive modeling with IP10 and PF4.
Collapse
Affiliation(s)
- Olivia Slater
- Department of Pharmaceutical Sciences, Southern Illinois University, Edwardsville, IL 62026, United States
| | - Bethany Miller
- Department of Pharmaceutical Sciences, Southern Illinois University, Edwardsville, IL 62026, United States
| | - Maria Kontoyianni
- Department of Pharmaceutical Sciences, Southern Illinois University, Edwardsville, IL 62026, United States
| |
Collapse
|
30
|
Zhao B, Katuwawala A, Oldfield CJ, Dunker AK, Faraggi E, Gsponer J, Kloczkowski A, Malhis N, Mirdita M, Obradovic Z, Söding J, Steinegger M, Zhou Y, Kurgan L. DescribePROT: database of amino acid-level protein structure and function predictions. Nucleic Acids Res 2021; 49:D298-D308. [PMID: 33119734 PMCID: PMC7778963 DOI: 10.1093/nar/gkaa931] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/11/2020] [Accepted: 10/05/2020] [Indexed: 12/30/2022] Open
Abstract
We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.
Collapse
Affiliation(s)
- Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Akila Katuwawala
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | | | - A Keith Dunker
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Eshel Faraggi
- Battelle Center for Mathematical Medicine at the Nationwide Children's Hospital, and Department of Pediatrics, The Ohio State University, Columbus, OH, USA
| | - Jörg Gsponer
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - Andrzej Kloczkowski
- Battelle Center for Mathematical Medicine at the Nationwide Children's Hospital, and Department of Pediatrics, The Ohio State University, Columbus, OH, USA
| | - Nawar Malhis
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - Milot Mirdita
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Zoran Obradovic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Martin Steinegger
- School of Biological Sciences and Institute of Molecular Biology & Genetics, Seoul National University, Seoul, Republic of Korea
| | - Yaoqi Zhou
- Institute for Glycomics, Griffith University, Gold Coast, Queensland, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
31
|
Miller-Vedam LE, Bräuning B, Popova KD, Schirle Oakdale NT, Bonnar JL, Prabu JR, Boydston EA, Sevillano N, Shurtleff MJ, Stroud RM, Craik CS, Schulman BA, Frost A, Weissman JS. Structural and mechanistic basis of the EMC-dependent biogenesis of distinct transmembrane clients. eLife 2020; 9:e62611. [PMID: 33236988 PMCID: PMC7785296 DOI: 10.7554/elife.62611] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Accepted: 11/17/2020] [Indexed: 12/11/2022] Open
Abstract
Membrane protein biogenesis in the endoplasmic reticulum (ER) is complex and failure-prone. The ER membrane protein complex (EMC), comprising eight conserved subunits, has emerged as a central player in this process. Yet, we have limited understanding of how EMC enables insertion and integrity of diverse clients, from tail-anchored to polytopic transmembrane proteins. Here, yeast and human EMC cryo-EM structures reveal conserved intricate assemblies and human-specific features associated with pathologies. Structure-based functional studies distinguish between two separable EMC activities, as an insertase regulating tail-anchored protein levels and a broader role in polytopic membrane protein biogenesis. These depend on mechanistically coupled yet spatially distinct regions including two lipid-accessible membrane cavities which confer client-specific regulation, and a non-insertase EMC function mediated by the EMC lumenal domain. Our studies illuminate the structural and mechanistic basis of EMC's multifunctionality and point to its role in differentially regulating the biogenesis of distinct client protein classes.
Collapse
Affiliation(s)
- Lakshmi E Miller-Vedam
- Molecular, Cellular, and Computational Biophysics Graduate Program, University of California, San FranciscoSan FranciscoUnited States
- Department of Biochemistry and Biophysics, University of California, San FranciscoSan FranciscoUnited States
- Department of Biology, Whitehead Institute, MITCambridgeUnited States
- Department of Cellular and Molecular Pharmacology, University of California, San FranciscoSan FranciscoUnited States
| | - Bastian Bräuning
- Department of Molecular Machines and Signaling, Max Planck Institute of BiochemistryMartinsriedGermany
| | - Katerina D Popova
- Department of Biology, Whitehead Institute, MITCambridgeUnited States
- Department of Cellular and Molecular Pharmacology, University of California, San FranciscoSan FranciscoUnited States
- Biomedical Sciences Graduate Program, University of California, San FranciscoSan FranciscoUnited States
| | - Nicole T Schirle Oakdale
- Department of Cellular and Molecular Pharmacology, University of California, San FranciscoSan FranciscoUnited States
| | - Jessica L Bonnar
- Department of Biology, Whitehead Institute, MITCambridgeUnited States
- Department of Cellular and Molecular Pharmacology, University of California, San FranciscoSan FranciscoUnited States
| | - Jesuraj R Prabu
- Department of Molecular Machines and Signaling, Max Planck Institute of BiochemistryMartinsriedGermany
| | - Elizabeth A Boydston
- Department of Cellular and Molecular Pharmacology, University of California, San FranciscoSan FranciscoUnited States
| | - Natalia Sevillano
- Department of Pharmaceutical Chemistry, University of California, San FranciscoSan FranciscoUnited States
| | - Matthew J Shurtleff
- Department of Cellular and Molecular Pharmacology, University of California, San FranciscoSan FranciscoUnited States
| | - Robert M Stroud
- Department of Biochemistry and Biophysics, University of California, San FranciscoSan FranciscoUnited States
| | - Charles S Craik
- Department of Pharmaceutical Chemistry, University of California, San FranciscoSan FranciscoUnited States
| | - Brenda A Schulman
- Department of Molecular Machines and Signaling, Max Planck Institute of BiochemistryMartinsriedGermany
| | - Adam Frost
- Department of Biochemistry and Biophysics, University of California, San FranciscoSan FranciscoUnited States
| | - Jonathan S Weissman
- Department of Biology, Whitehead Institute, MITCambridgeUnited States
- Department of Cellular and Molecular Pharmacology, University of California, San FranciscoSan FranciscoUnited States
- Howard Hughes Medical InstituteChevy ChaseUnited States
| |
Collapse
|
32
|
Enhancing protein backbone angle prediction by using simpler models of deep neural networks. Sci Rep 2020; 10:19430. [PMID: 33173130 PMCID: PMC7655839 DOI: 10.1038/s41598-020-76317-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Accepted: 10/23/2020] [Indexed: 11/09/2022] Open
Abstract
Protein structure prediction is a grand challenge. Prediction of protein structures via the representations using backbone dihedral angles has recently achieved significant progress along with the on-going surge of deep neural network (DNN) research in general. However, we observe that in the protein backbone angle prediction research, there is an overall trend to employ more and more complex neural networks and then to throw more and more features to the neural networks. While more features might add more predictive power to the neural network, we argue that redundant features could rather clutter the scenario and more complex neural networks then just could counterbalance the noise. From artificial intelligence and machine learning perspectives, problem representations and solution approaches do mutually interact and thus affect performance. We also argue that comparatively simpler predictors can more easily be reconstructed than the more complex ones. With these arguments in mind, we present a deep learning method named Simpler Angle Predictor (SAP) to train simpler DNN models that enhance protein backbone angle prediction. We then empirically show that SAP significantly outperforms existing state-of-the-art methods on well-known benchmark datasets: for some types of angles, the differences are above 3 in mean absolute error (MAE). The SAP program along with its data is available from the website https://gitlab.com/mahnewton/sap.
Collapse
|
33
|
Skolnick J, Gao M. The role of local versus nonlocal physicochemical restraints in determining protein native structure. Curr Opin Struct Biol 2020; 68:1-8. [PMID: 33129066 DOI: 10.1016/j.sbi.2020.10.008] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Revised: 10/03/2020] [Accepted: 10/05/2020] [Indexed: 12/15/2022]
Abstract
The tertiary structure of a native protein is dictated by the interplay of local secondary structure propensities, hydrogen bonding, and tertiary interactions. It is argued that the space of known protein topologies covers all single domain folds and results from the compactness of the native structure and excluded volume. Protein compactness combined with the chirality of the protein's side chains also yields native-like Ramachandran plots. It is the many-body, tertiary interactions among residues that collectively select for the global structure that a particular protein sequence adopts. This explains why the recent advances in deep-learning approaches that predict protein side-chain contacts, the distance matrix between residues, and sequence alignments are successful. They succeed because they implicitly learned the many-body interactions among protein residues.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, 950 Atlantic Drive, NW, Atlanta, GA 30332, United States.
| | - Mu Gao
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, 950 Atlantic Drive, NW, Atlanta, GA 30332, United States.
| |
Collapse
|
34
|
Qiu J, Nechaev D, Rost B. Protein-protein and protein-nucleic acid binding residues important for common and rare sequence variants in human. BMC Bioinformatics 2020; 21:452. [PMID: 33050876 PMCID: PMC7557062 DOI: 10.1186/s12859-020-03759-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 09/16/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Any two unrelated people differ by about 20,000 missense mutations (also referred to as SAVs: Single Amino acid Variants or missense SNV). Many SAVs have been predicted to strongly affect molecular protein function. Common SAVs (> 5% of population) were predicted to have, on average, more effect on molecular protein function than rare SAVs (< 1% of population). We hypothesized that the prevalence of effect in common over rare SAVs might partially be caused by common SAVs more often occurring at interfaces of proteins with other proteins, DNA, or RNA, thereby creating subgroup-specific phenotypes. We analyzed SAVs from 60,706 people through the lens of two prediction methods, one (SNAP2) predicting the effects of SAVs on molecular protein function, the other (ProNA2020) predicting residues in DNA-, RNA- and protein-binding interfaces. RESULTS Three results stood out. Firstly, SAVs predicted to occur at binding interfaces were predicted to more likely affect molecular function than those predicted as not binding (p value < 2.2 × 10-16). Secondly, for SAVs predicted to occur at binding interfaces, common SAVs were predicted more strongly with effect on protein function than rare SAVs (p value < 2.2 × 10-16). Restriction to SAVs with experimental annotations confirmed all results, although the resulting subsets were too small to establish statistical significance for any result. Thirdly, the fraction of SAVs predicted at binding interfaces differed significantly between tissues, e.g. urinary bladder tissue was found abundant in SAVs predicted at protein-binding interfaces, and reproductive tissues (ovary, testis, vagina, seminal vesicle and endometrium) in SAVs predicted at DNA-binding interfaces. CONCLUSIONS Overall, the results suggested that residues at protein-, DNA-, and RNA-binding interfaces contributed toward predicting that common SAVs more likely affect molecular function than rare SAVs.
Collapse
Affiliation(s)
- Jiajun Qiu
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany. .,TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), 85748, Garching, Germany. .,Biobank of Ninth People's Hospital, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200125, China.
| | - Dmitrii Nechaev
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), 85748, Garching, Germany
| | - Burkhard Rost
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany.,Institute of Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching, Munich, Germany.,Institute for Food and Plant Sciences (WZW) Weihenstephan, Alte Akademie 8, 85354, Freising, Germany
| |
Collapse
|
35
|
de Brevern AG. Impact of protein dynamics on secondary structure prediction. Biochimie 2020; 179:14-22. [PMID: 32946990 DOI: 10.1016/j.biochi.2020.09.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Revised: 09/04/2020] [Accepted: 09/10/2020] [Indexed: 02/08/2023]
Abstract
Protein 3D structures support their biological functions. As the number of protein structures is negligible in regards to the number of available protein sequences, prediction methodologies relying only on protein sequences are essential tools. In this field, protein secondary structure prediction (PSSPs) is a mature area, and is considered to have reached a plateau. Nonetheless, proteins are highly dynamical macromolecules, a property that could impact the PSSP methods. Indeed, in a previous study, the stability of local protein conformations was evaluated demonstrating that some regions easily changed to another type of secondary structure. The protein sequences of this dataset were used by PSSPs and their results compared to molecular dynamics to investigate their potential impact on the quality of the secondary structure prediction. Interestingly, a direct link is observed between the quality of the prediction and the stability of the assignment to the secondary structure state. The more stable a local protein conformation is, the better the prediction will be. The secondary structure assignment not taken from the crystallized structures but from the conformations observed during the dynamics slightly increase the quality of the secondary structure prediction. These results show that evaluation of PSSPs can be done differently, but also that the notion of dynamics can be included in development of PSSPs and other approaches such as de novo approaches.
Collapse
Affiliation(s)
- Alexandre G de Brevern
- Biologie Intégrée Du Globule Rouge UMR_S1134, Inserm, Université de Paris, Univ. de la Réunion, Univ. des Antilles, F-75739, Paris, France; Laboratoire D'Excellence GR-Ex, F-75739, Paris, France; Institut National de la Transfusion Sanguine (INTS), F-75739, Paris, France; IBL, F-75015, Paris, France.
| |
Collapse
|
36
|
Guo Z, Hou J, Cheng J. DNSS2: Improved ab initio protein secondary structure prediction using advanced deep learning architectures. Proteins 2020; 89:207-217. [PMID: 32893403 DOI: 10.1002/prot.26007] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Revised: 07/07/2020] [Accepted: 09/02/2020] [Indexed: 12/27/2022]
Abstract
Accurate prediction of protein secondary structure (alpha-helix, beta-strand and coil) is a crucial step for protein inter-residue contact prediction and ab initio tertiary structure prediction. In a previous study, we developed a deep belief network-based protein secondary structure method (DNSS1) and successfully advanced the prediction accuracy beyond 80%. In this work, we developed multiple advanced deep learning architectures (DNSS2) to further improve secondary structure prediction. The major improvements over the DNSS1 method include (a) designing and integrating six advanced one-dimensional deep convolutional/recurrent/residual/memory/fractal/inception networks to predict 3-state and 8-state secondary structure, and (b) using more sensitive profile features inferred from Hidden Markov model (HMM) and multiple sequence alignment (MSA). Most of the deep learning architectures are novel for protein secondary structure prediction. DNSS2 was systematically benchmarked on independent test data sets with eight state-of-art tools and consistently ranked as one of the best methods. Particularly, DNSS2 was tested on the protein targets of 2018 CASP13 experiment and achieved the Q3 score of 81.62%, SOV score of 72.19%, and Q8 score of 73.28%. DNSS2 is freely available at: https://github.com/multicom-toolbox/DNSS2.
Collapse
Affiliation(s)
- Zhiye Guo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, Missouri, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
37
|
Vermeyen T, Merten C. Solvation and the secondary structure of a proline-containing dipeptide: insights from VCD spectroscopy. Phys Chem Chem Phys 2020; 22:15640-15648. [PMID: 32617548 DOI: 10.1039/d0cp02283g] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
In this study we investigate the IR and VCD spectra of the diastereomeric dipeptide Boc-Pro-Phe-(n-propyl) 1 in chloroform-d1 (CDCl3) and the strongly hydrogen bonding solvent dimethylsulfoxide-d6 (DMSO-d6). From comparison of the experimental spectra, the amide II spectral region is identified as marker signature for the stereochemistry of the dipeptide: the homochiral LL-1 features a (+/-)-pattern in the amide II region of the VCD spectrum, while the amide II signature of the diastereomer LD-1 is inverted. Computational analysis of the IR and VCD spectra of LL-1 reveals that the experimentally observed amide II signature is characteristic for a βI-turn structure of the peptide. Likewise, the inverted pattern found for LD-1 arises from a βII-turn structure of the dipeptide. Following a micro-solvation approach, the experimental spectra recorded in DMSO-d6 are computationally well reproduced by considering only a single solvent molecule in a hydrogen bond with N-H groups. Considering a second solvent molecule, which would lead to a cleavage of intramolecular hydrogen bonds in 1, is found to give a significantly worse match with the experiment. Hence, the detailed computational analysis of the spectra of LL- and LD-1 recorded in DMSO-d6 confirms that the intramolecular hydrogen bonding pattern, that stabilizes the β-turns and other conformations of LL- and LD-1 in apolar solvents, remains intact. Our findings also show that it is essential to consider solvation explicitly in the analysis of the IR and VCD spectra of dipeptides in strongly hydrogen bonding solvents. As the solute-solvent interactions affect both conformational preferences and spectral signatures, it is also demonstrated that this inclusion of solvent molecules cannot be circumvented by applying fitting procedures to non-solvated structures.
Collapse
Affiliation(s)
- Tom Vermeyen
- Ruhr-Universität Bochum, Fakultät für Chemie und Biochemie, Organische Chemie II, Universitätsstraße 150, 44801 Bochum, Germany. and University of Antwerp, Department of Chemistry, MolSpec Group, Groenenborgerlaan 171, 2020 Antwerp, Belgium
| | - Christian Merten
- Ruhr-Universität Bochum, Fakultät für Chemie und Biochemie, Organische Chemie II, Universitätsstraße 150, 44801 Bochum, Germany.
| |
Collapse
|
38
|
Xu G, Wang Q, Ma J. OPUS-TASS: a protein backbone torsion angles and secondary structure predictor based on ensemble neural networks. Bioinformatics 2020; 36:5021-5026. [DOI: 10.1093/bioinformatics/btaa629] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Revised: 06/25/2020] [Accepted: 07/10/2020] [Indexed: 11/13/2022] Open
Abstract
Abstract
Motivation
Predictions of protein backbone torsion angles (ϕ and ψ) and secondary structure from sequence are crucial subproblems in protein structure prediction. With the development of deep learning approaches, their accuracies have been significantly improved. To capture the long-range interactions, most studies integrate bidirectional recurrent neural networks into their models. In this study, we introduce and modify a recently proposed architecture named Transformer to capture the interactions between the two residues theoretically with arbitrary distance. Moreover, we take advantage of multitask learning to improve the generalization of neural network by introducing related tasks into the training process. Similar to many previous studies, OPUS-TASS uses an ensemble of models and achieves better results.
Results
OPUS-TASS uses the same training and validation sets as SPOT-1D. We compare the performance of OPUS-TASS and SPOT-1D on TEST2016 (1213 proteins) and TEST2018 (250 proteins) proposed in the SPOT-1D paper, CASP12 (55 proteins), CASP13 (32 proteins) and CASP-FM (56 proteins) proposed in the SAINT paper, and a recently released PDB structure collection from CAMEO (93 proteins) named as CAMEO93. On these six test sets, OPUS-TASS achieves consistent improvements in both backbone torsion angles prediction and secondary structure prediction. On CAMEO93, SPOT-1D achieves the mean absolute errors of 16.89 and 23.02 for ϕ and ψ predictions, respectively, and the accuracies for 3- and 8-state secondary structure predictions are 87.72 and 77.15%, respectively. In comparison, OPUS-TASS achieves 16.56 and 22.56 for ϕ and ψ predictions, and 89.06 and 78.87% for 3- and 8-state secondary structure predictions, respectively. In particular, after using our torsion angles refinement method OPUS-Refine as the post-processing procedure for OPUS-TASS, the mean absolute errors for final ϕ and ψ predictions are further decreased to 16.28 and 21.98, respectively.
Availability and implementation
The training and the inference codes of OPUS-TASS and its data are available at https://github.com/thuxugang/opus_tass.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gang Xu
- Multiscale Research Institute of Complex Systems, Fudan University, Shanghai 200433, China
| | - Qinghua Wang
- Verna and Marrs Mclean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Jianpeng Ma
- Multiscale Research Institute of Complex Systems, Fudan University, Shanghai 200433, China
- Verna and Marrs Mclean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
- Department of Bioengineering, Rice University, Houston, TX 77030, USA
| |
Collapse
|
39
|
Shapovalov M, Dunbrack RL, Vucetic S. Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction. PLoS One 2020; 15:e0232528. [PMID: 32374785 PMCID: PMC7202669 DOI: 10.1371/journal.pone.0232528] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Accepted: 04/16/2020] [Indexed: 11/30/2022] Open
Abstract
Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.
Collapse
Affiliation(s)
- Maxim Shapovalov
- Fox Chase Cancer Center, Philadelphia, PA, United States of America
- Temple University, Philadelphia, PA, United States of America
| | | | | |
Collapse
|
40
|
Pritam M, Singh G, Swaroop S, Singh AK, Pandey B, Singh SP. A cutting-edge immunoinformatics approach for design of multi-epitope oral vaccine against dreadful human malaria. Int J Biol Macromol 2020; 158:159-179. [PMID: 32360460 PMCID: PMC7189201 DOI: 10.1016/j.ijbiomac.2020.04.191] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Revised: 03/28/2020] [Accepted: 04/22/2020] [Indexed: 12/18/2022]
Abstract
Human malaria is a pathogenic disease mainly caused by Plasmodium falciparum, which was responsible for about 405,000 deaths globally in the year 2018. To date, several vaccine candidates have been evaluated for prevention, which failed to produce optimal output at various preclinical/clinical stages. This study is based on designing of polypeptide vaccines (PVs) against human malaria that cover almost all stages of life-cycle of Plasmodium and for the same 5 genome derived predicted antigenic proteins (GDPAP) have been used. For the development of a multi-immune inducer, 15 PVs were initially designed using T-cell epitope ensemble, which covered >99% human population as well as linear B-cell epitopes with or without adjuvants. The immune simulation of PVs showed higher levels of T-cell and B-cell activities compared to positive and negative vaccine controls. Furthermore, in silico cloning of PVs and codon optimization followed by enhanced expression within Lactococcus lactis host system was also explored. Although, the study has sound theoretical and in silico findings, the in vitro/in vivo evaluation seems imperative to warrant the immunogenicity and safety of PVs towards management of P. falciparum infection in the future.
Collapse
Affiliation(s)
- Manisha Pritam
- Amity Institute of Biotechnology, Amity University Uttar Pradesh, Lucknow Campus, Lucknow 226028, India
| | - Garima Singh
- Amity Institute of Biotechnology, Amity University Uttar Pradesh, Lucknow Campus, Lucknow 226028, India
| | - Suchit Swaroop
- Experimental & Public Health Lab, Department of Zoology, University of Lucknow, Lucknow 226007, India
| | - Akhilesh Kumar Singh
- Department of Biotechnology, Mahatma Gandhi Central University, Bihar 845401, India
| | - Brijesh Pandey
- Department of Biotechnology, Mahatma Gandhi Central University, Bihar 845401, India
| | | |
Collapse
|
41
|
Veevers R, Cawley G, Hayward S. Investigation of sequence features of hinge-bending regions in proteins with domain movements using kernel logistic regression. BMC Bioinformatics 2020; 21:137. [PMID: 32272894 PMCID: PMC7147021 DOI: 10.1186/s12859-020-3464-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 03/20/2020] [Indexed: 11/12/2022] Open
Abstract
Background Hinge-bending movements in proteins comprising two or more domains form a large class of functional movements. Hinge-bending regions demarcate protein domains and collectively control the domain movement. Consequently, the ability to recognise sequence features of hinge-bending regions and to be able to predict them from sequence alone would benefit various areas of protein research. For example, an understanding of how the sequence features of these regions relate to dynamic properties in multi-domain proteins would aid in the rational design of linkers in therapeutic fusion proteins. Results The DynDom database of protein domain movements comprises sequences annotated to indicate whether the amino acid residue is located within a hinge-bending region or within an intradomain region. Using statistical methods and Kernel Logistic Regression (KLR) models, this data was used to determine sequence features that favour or disfavour hinge-bending regions. This is a difficult classification problem as the number of negative cases (intradomain residues) is much larger than the number of positive cases (hinge residues). The statistical methods and the KLR models both show that cysteine has the lowest propensity for hinge-bending regions and proline has the highest, even though it is the most rigid amino acid. As hinge-bending regions have been previously shown to occur frequently at the terminal regions of the secondary structures, the propensity for proline at these regions is likely due to its tendency to break secondary structures. The KLR models also indicate that isoleucine may act as a domain-capping residue. We have found that a quadratic KLR model outperforms a linear KLR model and that improvement in performance occurs up to very long window lengths (eighty residues) indicating long-range correlations. Conclusion In contrast to the only other approach that focused solely on interdomain hinge-bending regions, the method provides a modest and statistically significant improvement over a random classifier. An explanation of the KLR results is that in the prediction of hinge-bending regions a long-range correlation is at play between a small number amino acids that either favour or disfavour hinge-bending regions. The resulting sequence-based prediction tool, HingeSeek, is available to run through a webserver at hingeseek.cmp.uea.ac.uk.
Collapse
Affiliation(s)
- Ruth Veevers
- Computational Biology Laboratory, School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK
| | - Gavin Cawley
- Computational Biology Laboratory, School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK.
| | - Steven Hayward
- Computational Biology Laboratory, School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK.
| |
Collapse
|
42
|
Smolarczyk T, Roterman-Konieczna I, Stapor K. Protein Secondary Structure Prediction: A Review of Progress and Directions. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017104639] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Over the last few decades, a search for the theory of protein folding has
grown into a full-fledged research field at the intersection of biology, chemistry and informatics.
Despite enormous effort, there are still open questions and challenges, like understanding the rules
by which amino acid sequence determines protein secondary structure.
Objective:
In this review, we depict the progress of the prediction methods over the years and
identify sources of improvement.
Methods:
The protein secondary structure prediction problem is described followed by the discussion
on theoretical limitations, description of the commonly used data sets, features and a review
of three generations of methods with the focus on the most recent advances. Additionally, methods
with available online servers are assessed on the independent data set.
Results:
The state-of-the-art methods are currently reaching almost 88% for 3-class prediction and
76.5% for an 8-class prediction.
Conclusion:
This review summarizes recent advances and outlines further research directions.
Collapse
Affiliation(s)
- Tomasz Smolarczyk
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| | - Irena Roterman-Konieczna
- Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Krakow, Poland
| | - Katarzyna Stapor
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
43
|
The Order-Disorder Continuum: Linking Predictions of Protein Structure and Disorder through Molecular Simulation. Sci Rep 2020; 10:2068. [PMID: 32034199 PMCID: PMC7005769 DOI: 10.1038/s41598-020-58868-w] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Accepted: 10/16/2019] [Indexed: 12/11/2022] Open
Abstract
Intrinsically disordered proteins (IDPs) and intrinsically disordered regions within proteins (IDRs) serve an increasingly expansive list of biological functions, including regulation of transcription and translation, protein phosphorylation, cellular signal transduction, as well as mechanical roles. The strong link between protein function and disorder motivates a deeper fundamental characterization of IDPs and IDRs for discovering new functions and relevant mechanisms. We review recent advances in experimental techniques that have improved identification of disordered regions in proteins. Yet, experimentally curated disorder information still does not currently scale to the level of experimentally determined structural information in folded protein databases, and disorder predictors rely on several different binary definitions of disorder. To link secondary structure prediction algorithms developed for folded proteins and protein disorder predictors, we conduct molecular dynamics simulations on representative proteins from the Protein Data Bank, comparing secondary structure and disorder predictions with simulation results. We find that structure predictor performance from neural networks can be leveraged for the identification of highly dynamic regions within molecules, linked to disorder. Low accuracy structure predictions suggest a lack of static structure for regions that disorder predictors fail to identify. While disorder databases continue to expand, secondary structure predictors and molecular simulations can improve disorder predictor performance, which aids discovery of novel functions of IDPs and IDRs. These observations provide a platform for the development of new, integrated structural databases and fusion of prediction tools toward protein disorder characterization in health and disease.
Collapse
|
44
|
Torrisi M, Pollastri G, Le Q. Deep learning methods in protein structure prediction. Comput Struct Biotechnol J 2020; 18:1301-1310. [PMID: 32612753 PMCID: PMC7305407 DOI: 10.1016/j.csbj.2019.12.011] [Citation(s) in RCA: 132] [Impact Index Per Article: 26.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 12/19/2019] [Accepted: 12/20/2019] [Indexed: 01/01/2023] Open
Abstract
Protein Structure Prediction is a central topic in Structural Bioinformatics. Since the '60s statistical methods, followed by increasingly complex Machine Learning and recently Deep Learning methods, have been employed to predict protein structural information at various levels of detail. In this review, we briefly introduce the problem of protein structure prediction and essential elements of Deep Learning (such as Convolutional Neural Networks, Recurrent Neural Networks and basic feed-forward Neural Networks they are founded on), after which we discuss the evolution of predictive methods for one-dimensional and two-dimensional Protein Structure Annotations, from the simple statistical methods of the early days, to the computationally intensive highly-sophisticated Deep Learning algorithms of the last decade. In the process, we review the growth of the databases these algorithms are based on, and how this has impacted our ability to leverage knowledge about evolution and co-evolution to achieve improved predictions. We conclude this review outlining the current role of Deep Learning techniques within the wider pipelines to predict protein structures and trying to anticipate what challenges and opportunities may arise next.
Collapse
Affiliation(s)
- Mirko Torrisi
- School of Computer Science, University College Dublin, Ireland
| | | | - Quan Le
- Centre for Applied Data Analytics Research, University College Dublin, Ireland
| |
Collapse
|
45
|
Abstract
In bottom-up proteomics, proteins are typically identified by enzymatic digestion into peptides, tandem mass spectrometry and comparison of the tandem mass spectra with those predicted from a sequence database for peptides within measurement uncertainty from the experimentally obtained mass. Although now decreasingly common, isolated proteins or simple protein mixtures can also be identified by measuring only the masses of the peptides resulting from the enzymatic digest, without any further fragmentation. Separation methods such as liquid chromatography and electrophoresis are often used to fractionate complex protein or peptide mixtures prior to analysis by mass spectrometry. Although the primary reason for this is to avoid ion suppression and improve data quality, these separations are based on physical and chemical properties of the peptides or proteins and therefore also provide information about them. Depending on the separation method, this could be protein molecular weight (SDS-PAGE), isoelectric point (IEF), charge at a known pH (ion exchange chromatography), or hydrophobicity (reversed phase chromatography). These separations produce approximate measurements on properties that to some extent can be predicted from amino acid sequences. In the case of molecular weight of proteins without posttranslational modifications this is straightforward: simply add the molecular weights of the amino acid residues in the protein. For IEF, charge and hydrophobicity, the order of the amino acids, and folding state of the peptide or protein also matter, but it is nevertheless possible to predict the behavior of peptides and proteins in these separation methods to a degree which renders such predictions useful. This chapter reviews the topic of using data from separation methods for identification and validation in proteomics, with special emphasis on predicting retention times of tryptic peptides in reversed-phase chromatography under acidic conditions, as this is one of the most commonly used separation methods in bottom-up proteomics.
Collapse
|
46
|
Long S, Tian P. Protein secondary structure prediction with context convolutional neural network. RSC Adv 2019; 9:38391-38396. [PMID: 35540205 PMCID: PMC9075825 DOI: 10.1039/c9ra05218f] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 11/18/2019] [Indexed: 11/21/2022] Open
Abstract
Protein secondary structure (SS) prediction is important for studying protein structure and function. Both traditional machine learning methods and deep learning neural networks have been utilized and great progress has been achieved in approaching the theoretical limit. Convolutional and recurrent neural networks are two major types of deep learning architectures with comparable prediction accuracy but different training procedures to achieve optimal performance. We are interested in seeking a novel architectural style with competitive performance and in understanding the performance of different architectures with similar training procedures. We constructed a context convolutional neural network (Contextnet) and compared its performance with popular models (e.g. convolutional neural network, recurrent neural network, conditional neural fields…) under similar training procedures on a Jpred dataset. The Contextnet was proven to be highly competitive. Additionally, we retrained the network with the Cullpdb dataset and compared with Jpred, ReportX, Spider3 server and MUFold-SS method, the Contextnet was found to be more Q3 accurate on a CASP13 dataset. Training procedures were found to have significant impact on the accuracy of the Contextnet. Protein secondary structure prediction using context convolutional neural network.![]()
Collapse
Affiliation(s)
| | - Pu Tian
- School of Life Science, School of Artificial Intelligence, Jilin University 2699 Qian-jin Street Changchun China 130012
| |
Collapse
|
47
|
Zamora-Carreras H, Maestro B, Sanz JM, Jiménez MA. Turncoat Polypeptides: We Adapt to Our Environment. Chembiochem 2019; 21:432-441. [PMID: 31456307 DOI: 10.1002/cbic.201900446] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Indexed: 01/25/2023]
Abstract
A common interpretation of Anfinsen's hypothesis states that one amino acid sequence should fold into a single, native, ordered state, or a highly similar set thereof, coinciding with the global minimum in the folding-energy landscape, which, in turn, is responsible for the function of the protein. However, this classical view is challenged by many proteins and peptide sequences, which can adopt exchangeable, significantly dissimilar conformations that even fulfill different biological roles. The similarities and differences of concepts related to these proteins, mainly chameleon sequences, metamorphic proteins, and switch peptides, which are all denoted herein "turncoat" polypeptides, are reviewed. As well as adding a twist to the conventional view of protein folding, the lack of structural definition adds clear versatility to the activity of proteins and can be used as a tool for protein design and further application in biotechnology and biomedicine.
Collapse
Affiliation(s)
- Héctor Zamora-Carreras
- Instituto de Química-Física Rocasolano (IQFR), Consejo Superior de Investigaciones Científicas (CSIC), Serrano 119, 28006, Madrid, Spain
| | - Beatriz Maestro
- Centro de Investigaciones Biológicas (CIB), Consejo Superior de Investigaciones Científicas (CSIC), Ramiro de Maeztu 9, 28040, Madrid, Spain
| | - Jesús M Sanz
- Centro de Investigaciones Biológicas (CIB), Consejo Superior de Investigaciones Científicas (CSIC), Ramiro de Maeztu 9, 28040, Madrid, Spain.,Centro de Investigación Biomédica en Red de Enfermedades Respiratorias (CIBERES), Av. Monforte de Lemos, 3-5. Pabellón, 28029, Madrid, Spain
| | - M Angeles Jiménez
- Instituto de Química-Física Rocasolano (IQFR), Consejo Superior de Investigaciones Científicas (CSIC), Serrano 119, 28006, Madrid, Spain
| |
Collapse
|
48
|
Sample Reduction Strategies for Protein Secondary Structure Prediction. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9204429] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Predicting the secondary structure from protein sequence plays a crucial role in estimating the 3D structure, which has applications in drug design and in understanding the function of proteins. As new genes and proteins are discovered, the large size of the protein databases and datasets that can be used for training prediction models grows considerably. A two-stage hybrid classifier, which employs dynamic Bayesian networks and a support vector machine (SVM) has been shown to provide state-of-the-art prediction accuracy for protein secondary structure prediction. However, SVM is not efficient for large datasets due to the quadratic optimization involved in model training. In this paper, two techniques are implemented on CB513 benchmark for reducing the number of samples in the train set of the SVM. The first method randomly selects a fraction of data samples from the train set using a stratified selection strategy. This approach can remove approximately 50% of the data samples from the train set and reduce the model training time by 73.38% on average without decreasing the prediction accuracy significantly. The second method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers in order to improve the training time. To cluster the feature vectors, the hierarchical clustering method is implemented, for which the number of clusters and the number of nearest neighbors are optimized as hyper-parameters by computing the prediction accuracy on validation sets. It is found that clustering can reduce the size of the train set by 26% without reducing the prediction accuracy. Among the clustering techniques Ward’s method provided the best accuracy on test data.
Collapse
|
49
|
Smolarczyk T, Stapor K, Roterman-Konieczna I. Backbone dihedral angles prediction servers for protein early-stage structure prediction. BIO-ALGORITHMS AND MED-SYSTEMS 2019. [DOI: 10.1515/bams-2019-0034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
AbstractThree-dimensional protein structure prediction is an important task in science at the intersection of biology, chemistry, and informatics, and it is crucial for determining the protein function. In the two-stage protein folding model, based on an early- and late-stage intermediates, we propose to use state-of-the-art secondary structure prediction servers for backbone dihedral angles prediction and devise an early-stage structure. Early-stage structures are used as a starting point for protein folding simulations, and any errors in this stage affect the final predictions. We have shown that modern secondary structure prediction servers could increase the accuracy of early-stage predictions compared to previously reported models.
Collapse
Affiliation(s)
- Tomasz Smolarczyk
- Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, Poland
| | - Katarzyna Stapor
- Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, Poland
| | - Irena Roterman-Konieczna
- Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Kraków, Poland
| |
Collapse
|
50
|
A Bi-LSTM Based Ensemble Algorithm for Prediction of Protein Secondary Structure. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9173538] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The prediction of protein secondary structure continues to be an active area of research in bioinformatics. In this paper, a Bi-LSTM based ensemble model is developed for the prediction of protein secondary structure. The ensemble model with dual loss function consists of five sub-models, which are finally joined by a Bi-LSTM layer. In contrast to existing ensemble methods, which generally train each sub-model and then join them as a whole, this ensemble model and sub-models can be trained simultaneously and the performance of each model can be observed and compared during the training process. Three independent test sets (e.g., data1199, 513 protein Cuff & Barton set (CB513) and 203 proteins from Critical Appraisals Skills Programme (CASP203)) are employed to test the method. On average, the ensemble model achieved 84.3% in Q 3 accuracy and 81.9% in segment overlap measure ( SOV ) score by using 10-fold cross validation. There is an improvement of up to 1% over some state-of-the-art prediction methods of protein secondary structure.
Collapse
|