1
|
Zheng W. Predicting hotspots for disease-causing single nucleotide variants using sequences-based coevolution, network analysis, and machine learning. PLoS One 2024; 19:e0302504. [PMID: 38743747 PMCID: PMC11093321 DOI: 10.1371/journal.pone.0302504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 04/05/2024] [Indexed: 05/16/2024] Open
Abstract
To enable personalized medicine, it is important yet highly challenging to accurately predict disease-causing mutations in target proteins at high throughput. Previous computational methods have been developed using evolutionary information in combination with various biochemical and structural features of protein residues to discriminate neutral vs. deleterious mutations. However, the power of these methods is often limited because they either assume known protein structures or treat residues independently without fully considering their interactions. To address the above limitations, we build upon recent progress in machine learning, network analysis, and protein language models, and develop a sequences-based variant site prediction workflow based on the protein residue contact networks: 1. We employ and integrate various methods of building protein residue networks using state-of-the-art coevolution analysis tools (RaptorX, DeepMetaPSICOV, and SPOT-Contact) powered by deep learning. 2. We use machine learning algorithms (Random Forest, Gradient Boosting, and Extreme Gradient Boosting) to optimally combine 20 network centrality scores to jointly predict key residues as hot spots for disease mutations. 3. Using a dataset of 107 proteins rich in disease mutations, we rigorously evaluate the network scores individually and collectively (via machine learning). This work supports a promising strategy of combining an ensemble of network scores based on different coevolution analysis methods (and optionally predictive scores from other methods) via machine learning to predict hotspot sites of disease mutations, which will inform downstream applications of disease diagnosis and targeted drug design.
Collapse
Affiliation(s)
- Wenjun Zheng
- Department of Physics, State University of New York at Buffalo, Buffalo, NY, United States of America
| |
Collapse
|
2
|
Wang X, Li A, Li X, Cui H. Empowering Protein Engineering through Recombination of Beneficial Substitutions. Chemistry 2024; 30:e202303889. [PMID: 38288640 DOI: 10.1002/chem.202303889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Indexed: 02/24/2024]
Abstract
Directed evolution stands as a seminal technology for generating novel protein functionalities, a cornerstone in biocatalysis, metabolic engineering, and synthetic biology. Today, with the development of various mutagenesis methods and advanced analytical machines, the challenge of diversity generation and high-throughput screening platforms is largely solved, and one of the remaining challenges is: how to empower the potential of single beneficial substitutions with recombination to achieve the epistatic effect. This review overviews experimental and computer-assisted recombination methods in protein engineering campaigns. In addition, integrated and machine learning-guided strategies were highlighted to discuss how these recombination approaches contribute to generating the screening library with better diversity, coverage, and size. A decision tree was finally summarized to guide the further selection of proper recombination strategies in practice, which was beneficial for accelerating protein engineering.
Collapse
Affiliation(s)
- Xinyue Wang
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Anni Li
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Xiujuan Li
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Haiyang Cui
- School of Life Sciences, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| |
Collapse
|
3
|
Xu M, Abdullah NA, Md Sabri AQ. A method to improve the prediction performance of cancer-gene association by screening negative training samples through gene network data. Comput Biol Chem 2024; 108:107997. [PMID: 38154318 DOI: 10.1016/j.compbiolchem.2023.107997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Revised: 11/03/2023] [Accepted: 12/03/2023] [Indexed: 12/30/2023]
Abstract
This work focuses on data sampling in cancer-gene association prediction. Currently, researchers are using machine learning methods to predict genes that are more likely to produce cancer-causing mutations. To improve the performance of machine learning models, methods have been proposed, one of which is to improve the quality of the training data. Existing methods focus mainly on positive data, i.e. cancer driver genes, for screening selection. This paper proposes a low-cancer-related gene screening method based on gene network and graph theory algorithms to improve the negative samples selection. Genetic data with low cancer correlation is used as negative training samples. After experimental verification, using the negative samples screened by this method to train the cancer gene classification model can improve prediction performance. The biggest advantage of this method is that it can be easily combined with other methods that focus on enhancing the quality of positive training samples. It has been demonstrated that significant improvement is achieved by combining this method with three state-of-the-arts cancer gene prediction methods.
Collapse
Affiliation(s)
- Mingzhe Xu
- Faculty of Computer Science & Information Technology, Universiti Malaya, Kuala Lumpur, 50603 Malaysia; School of Energy and Intelligence Engineering, Henan University of Animal Husbandry and Economy, #6 North Longzihu Rd, Zhengzhou 450000, China.
| | - Nor Aniza Abdullah
- Faculty of Computer Science & Information Technology, Universiti Malaya, Kuala Lumpur, 50603 Malaysia.
| | - Aznul Qalid Md Sabri
- Faculty of Computer Science & Information Technology, Universiti Malaya, Kuala Lumpur, 50603 Malaysia.
| |
Collapse
|
4
|
Zhao C, Wang S. AttCON: With better MSAs and attention mechanism for accurate protein contact map prediction. Comput Biol Med 2024; 169:107822. [PMID: 38091726 DOI: 10.1016/j.compbiomed.2023.107822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 11/19/2023] [Accepted: 12/04/2023] [Indexed: 02/08/2024]
Abstract
Protein contact map prediction is a critical and vital step in protein structure prediction, and its accuracy is highly contingent upon the feature representations of protein sequence information and the efficacy of deep learning models. In this paper, we propose an algorithm, DeepMSA+, to generate protein multiple sequence alignments (MSAs) and to construct feature representations based on co-evolutionary information and sequence information derived from MSAs. We also propose an improved deep learning model, AttCON, for training input features to predict protein contact map. The model incorporates an attention module, and by comparing different attention modules, we find a parameter-free attention module suitable for contact map prediction. Additionally, we use the Focal Loss function to better address the data imbalance issue in protein contact map. We also developed a weighted evaluation index (W score) for model evaluation, which takes into account a wide range of metrics. W score is comprehensive in its scope, with a particular focus on the precision of predictions for medium-range and long-range contacts. Experimental results show that AttCON achieves good precision results on datasets from CASP11 to CASP15. Compared to some state-of-the-art methods, it achieves an average improvement of over 5% in both medium-range and long-range predictions, and W score is improved by an average of 2 points.
Collapse
Affiliation(s)
- Che Zhao
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China
| | - Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China; Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming, 650504, Yunnan, China.
| |
Collapse
|
5
|
Montezano D, Bernstein R, Copeland MM, Slusky JSG. General features of transmembrane beta barrels from a large database. Proc Natl Acad Sci U S A 2023; 120:e2220762120. [PMID: 37432995 PMCID: PMC10629564 DOI: 10.1073/pnas.2220762120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Accepted: 06/03/2023] [Indexed: 07/13/2023] Open
Abstract
Large datasets contribute new insights to subjects formerly investigated by exemplars. We used coevolution data to create a large, high-quality database of transmembrane β-barrels (TMBB). By applying simple feature detection on generated evolutionary contact maps, our method (IsItABarrel) achieves 95.88% balanced accuracy when discriminating among protein classes. Moreover, comparison with IsItABarrel revealed a high rate of false positives in previous TMBB algorithms. In addition to being more accurate than previous datasets, our database (available online) contains 1,938,936 bacterial TMBB proteins from 38 phyla, respectively, 17 and 2.2 times larger than the previous sets TMBB-DB and OMPdb. We anticipate that due to its quality and size, the database will serve as a useful resource where high-quality TMBB sequence data are required. We found that TMBBs can be divided into 11 types, three of which have not been previously reported. We find tremendous variance in proteome percentage among TMBB-containing organisms with some using 6.79% of their proteome for TMBBs and others using as little as 0.27% of their proteome. The distribution of the lengths of the TMBBs is suggestive of previously hypothesized duplication events. In addition, we find that the C-terminal β-signal varies among different classes of bacteria though its consensus sequence is LGLGYRF. However, this β-signal is only characteristic of prototypical TMBBs. The ten non-prototypical barrel types have other C-terminal motifs, and it remains to be determined if these alternative motifs facilitate TMBB insertion or perform any other signaling function.
Collapse
Affiliation(s)
- Daniel Montezano
- Computational Biology Program, University of Kansas, Lawrence, KS66045
| | - Rebecca Bernstein
- Computational Biology Program, University of Kansas, Lawrence, KS66045
| | | | - Joanna S. G. Slusky
- Computational Biology Program, University of Kansas, Lawrence, KS66045
- Department of Molecular Biosciences, University of Kansas, Lawrence, KS66045
| |
Collapse
|
6
|
Meyer L, Crocoll C, Halkier BA, Mirza OA, Xu D. Identification of key amino acid residues in AtUMAMIT29 for transport of glucosinolates. FRONTIERS IN PLANT SCIENCE 2023; 14:1219783. [PMID: 37528977 PMCID: PMC10388549 DOI: 10.3389/fpls.2023.1219783] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 06/08/2023] [Indexed: 08/03/2023]
Abstract
Glucosinolates are key defense compounds of plants in Brassicales order, and their accumulation in seeds is essential for the protection of the next generation. Recently, members of the Usually Multiple Amino acids Move In and Out Transporter (UMAMIT) family were shown to be essential for facilitating transport of seed-bound glucosinolates from site of synthesis within the reproductive organ to seeds. Here, we set out to identify amino acid residues responsible for glucosinolate transport activity of the main seed glucosinolate exporter UMAMIT29 in Arabidopsis thaliana. Based on a predicted model of UMAMIT29, we propose that the substrate transporting cavity consists of 51 residues, of which four are highly conserved residues across all the analyzed homologs of UMAMIT29. A comparison of the putative substrate binding site of homologs within the brassicaceous-specific, glucosinolate-transporting clade with the non-brassicaceous-specific, non-glucosinolate-transporting UMAMIT32 clade identified 11 differentially conserved sites. When each of the 11 residues of UMAMIT29 was individually mutated into the corresponding residue in UMAMIT32, five mutant variants (UMAMIT29#V27F, UMAMIT29#M86V, UMAMIT29#L109V, UMAMIT29#Q263S, and UMAMIT29#T267Y) reduced glucosinolate transport activity over 75% compared to wild-type UMAMIT29. This suggests that these residues are key for UMAMIT29-mediated glucosinolate transport activity and thus potential targets for blocking the transport of glucosinolates to the seeds.
Collapse
Affiliation(s)
- Lasse Meyer
- Department of Plant and Environmental Sciences, Faculty of Science, University of Copenhagen, Frederiksberg, Denmark
| | - Christoph Crocoll
- Department of Plant and Environmental Sciences, Faculty of Science, University of Copenhagen, Frederiksberg, Denmark
| | - Barbara Ann Halkier
- Department of Plant and Environmental Sciences, Faculty of Science, University of Copenhagen, Frederiksberg, Denmark
| | - Osman Asghar Mirza
- Department of Drug Design and Pharmacology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Deyang Xu
- Department of Plant and Environmental Sciences, Faculty of Science, University of Copenhagen, Frederiksberg, Denmark
| |
Collapse
|
7
|
The SspB adaptor drives structural changes in the AAA+ ClpXP protease during ssrA-tagged substrate delivery. Proc Natl Acad Sci U S A 2023; 120:e2219044120. [PMID: 36730206 PMCID: PMC9963277 DOI: 10.1073/pnas.2219044120] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Energy-dependent protein degradation by the AAA+ ClpXP protease helps maintain protein homeostasis in bacteria and eukaryotic organelles of bacterial origin. In Escherichia coli and many other proteobacteria, the SspB adaptor assists ClpXP in degrading ssrA-tagged polypeptides produced as a consequence of tmRNA-mediated ribosome rescue. By tethering these incomplete ssrA-tagged proteins to ClpXP, SspB facilitates their efficient degradation at low substrate concentrations. How this process occurs structurally is unknown. Here, we present a cryo-EM structure of the SspB adaptor bound to a GFP-ssrA substrate and to ClpXP. This structure provides evidence for simultaneous contacts of SspB and ClpX with the ssrA tag within the tethering complex, allowing direct substrate handoff concomitant with the initiation of substrate translocation. Furthermore, our structure reveals that binding of the substrate·adaptor complex induces unexpected conformational changes within the spiral structure of the AAA+ ClpX hexamer and its interaction with the ClpP tetradecamer.
Collapse
|
8
|
Newman KE, Tindall SN, Mader SL, Khalid S, Thomas GH, Van Der Woude MW. A novel fold for acyltransferase-3 (AT3) proteins provides a framework for transmembrane acyl-group transfer. eLife 2023; 12:e81547. [PMID: 36630168 PMCID: PMC9833829 DOI: 10.7554/elife.81547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 12/04/2022] [Indexed: 01/12/2023] Open
Abstract
Acylation of diverse carbohydrates occurs across all domains of life and can be catalysed by proteins with a membrane bound acyltransferase-3 (AT3) domain (PF01757). In bacteria, these proteins are essential in processes including symbiosis, resistance to viruses and antimicrobials, and biosynthesis of antibiotics, yet their structure and mechanism are largely unknown. In this study, evolutionary co-variance analysis was used to build a computational model of the structure of a bacterial O-antigen modifying acetyltransferase, OafB. The resulting structure exhibited a novel fold for the AT3 domain, which molecular dynamics simulations demonstrated is stable in the membrane. The AT3 domain contains 10 transmembrane helices arranged to form a large cytoplasmic cavity lined by residues known to be essential for function. Further molecular dynamics simulations support a model where the acyl-coA donor spans the membrane through accessing a pore created by movement of an important loop capping the inner cavity, enabling OafB to present the acetyl group close to the likely catalytic resides on the extracytoplasmic surface. Limited but important interactions with the fused SGNH domain in OafB are identified, and modelling suggests this domain is mobile and can both accept acyl-groups from the AT3 and then reach beyond the membrane to reach acceptor substrates. Together this new general model of AT3 function provides a framework for the development of inhibitors that could abrogate critical functions of bacterial pathogens.
Collapse
Affiliation(s)
- Kahlan E Newman
- School of Chemistry, University of SouthamptonSouthamptonUnited Kingdom
| | - Sarah N Tindall
- Department of Biology and the York Biomedical Research Institute, University of YorkYorkUnited Kingdom
| | - Sophie L Mader
- Department of Biochemistry, University of OxfordOxfordUnited Kingdom
| | - Syma Khalid
- Department of Biochemistry, University of OxfordOxfordUnited Kingdom
| | - Gavin H Thomas
- Department of Biology and the York Biomedical Research Institute, University of YorkYorkUnited Kingdom
| | - Marjan W Van Der Woude
- Hull York Medical School and the York Biomedical Research Institute, University of YorkYorkUnited Kingdom
| |
Collapse
|
9
|
Mufassirin MMM, Newton MAH, Sattar A. Artificial intelligence for template-free protein structure prediction: a comprehensive review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10350-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
10
|
Weyer R, Hellmann MJ, Hamer-Timmermann SN, Singh R, Moerschbacher BM. Customized chitooligosaccharide production-controlling their length via engineering of rhizobial chitin synthases and the choice of expression system. Front Bioeng Biotechnol 2022; 10:1073447. [PMID: 36588959 PMCID: PMC9795070 DOI: 10.3389/fbioe.2022.1073447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Accepted: 11/28/2022] [Indexed: 12/15/2022] Open
Abstract
Chitooligosaccharides (COS) have attracted attention from industry and academia in various fields due to their diverse bioactivities. However, their conventional chemical production is environmentally unfriendly and in addition, defined and pure molecules are both scarce and expensive. A promising alternative is the in vivo synthesis of desired COS in microbial platforms with specific chitin synthases enabling a more sustainable production. Hence, we examined the whole cell factory approach with two well-established microorganisms-Escherichia coli and Corynebacterium glutamicum-to produce defined COS with the chitin synthase NodC from Rhizobium sp. GRH2. Moreover, based on an in silico model of the synthase, two amino acids potentially relevant for COS length were identified and mutated to direct the production. Experimental validation showed the influence of the expression system, the mutations, and their combination on COS length, steering the production from originally pentamers towards tetramers or hexamers, the latter virtually pure. Possible explanations are given by molecular dynamics simulations. These findings pave the way for a better understanding of chitin synthases, thus allowing a more targeted production of defined COS. This will, in turn, at first allow better research of COS' bioactivities, and subsequently enable sustainable large-scale production of oligomers.
Collapse
|
11
|
Newton MH, Zaman R, Mataeimoghadam F, Rahman J, Sattar A. Constraint Guided Beta-Sheet Refinement for Protein Structure Prediction. Comput Biol Chem 2022; 101:107773. [DOI: 10.1016/j.compbiolchem.2022.107773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Revised: 09/15/2022] [Accepted: 09/16/2022] [Indexed: 11/16/2022]
|
12
|
An J, Weng X. Collectively encoding protein properties enriches protein language models. BMC Bioinformatics 2022; 23:467. [DOI: 10.1186/s12859-022-05031-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 10/31/2022] [Indexed: 11/10/2022] Open
Abstract
AbstractPre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.
Collapse
|
13
|
Gill ML. The rise of the machines in chemistry. MAGNETIC RESONANCE IN CHEMISTRY : MRC 2022; 60:1044-1051. [PMID: 35976263 DOI: 10.1002/mrc.5304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 08/07/2022] [Accepted: 08/09/2022] [Indexed: 06/15/2023]
Abstract
The use of artificial intelligence and, more specifically, deep learning methods in chemistry is becoming increasingly common. Applications in informatics fields, such as cheminformatics and proteomics, structural biology, and spectroscopy, including NMR, are on the rise. Recent developments in model architectures, such as graph convolutional neural networks and transformers, have been enabled by advancements in computational hardware and software. However, model architectures with more predictive power often require larger amounts of training data, which can be challenging to acquire, but this requirement can be mitigated through techniques like pretraining and fine-tuning. In spite of these successes, challenges remain, such as normalization and scaling of data, availability of experimentally acquired data, and model explainability.
Collapse
|
14
|
Omoboyede V, Ibrahim O, Umar HI, Bello T, Adedeji AA, Khalid A, Fayojegbe ES, Ayomide AB, Chukwuemeka PO. Designing a vaccine-based therapy against Epstein-Barr virus-associated tumors using immunoinformatics approach. Comput Biol Med 2022; 150:106128. [PMID: 36179514 DOI: 10.1016/j.compbiomed.2022.106128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 08/05/2022] [Accepted: 09/18/2022] [Indexed: 11/26/2022]
Abstract
Epstein-Barr virus (EBV) is widely known due to its role in the etiology of infectious mononucleosis. However, it is the first oncovirus that was identified and has been implicated in the etiology of several types of cancers. Globally, EBV infection is associated with more than 200, 000 new cancer cases and 150, 000 deaths yearly. A prophylactic or therapeutic vaccine targeting tumors associated with EBV infection is currently lacking. Therefore, this study aimed to develop a multiepitope-based polyvalent vaccine against EBV-associated tumors using immunoinformatics approach. The latency-associated proteins (LAP) of three strains of the virus were used in this study. Potential epitopes predicted from the proteins were analyzed and selected based on several predicted properties. Thirty viable B-cell and T-cell epitopes were selected and conjugated using various linkers alongside beta-defensin 3 as an adjuvant and pan HLA DR-binding epitope (PADRE) sequence to improve the immunogenicity of the vaccine construct. Molecular docking studies of the vaccine construct against toll-like receptors (TLRs) showed it is capable of inducing immune response via recognition by TLRs while immune simulation studies showed it could induce both cellular and humoral immune responses. Furthermore, molecular dynamics study of the complex formed by the vaccine candidate and TLR-4 showed that the complex was stable. Ultimately, the designed vaccine showed desirable properties based on in silico evaluation; however, experimental studies are needed to validate the efficacy of the vaccine against EBV-associated tumors.
Collapse
Affiliation(s)
- Victor Omoboyede
- Department of Biochemistry, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria; Computer Aided Therapeutics Laboratory (CATL) Group, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria; Computer Aided Therapeutics and Drug Design (CATDD) Group, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria.
| | - Ochapa Ibrahim
- Computer Aided Therapeutics and Drug Design (CATDD) Group, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria; Faculty of Pharmaceutical Sciences, Ahmadu Bello University, Zaria, Kaduna State, Nigeria.
| | - Haruna Isiyaku Umar
- Department of Biochemistry, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria; Computer Aided Therapeutics and Drug Design (CATDD) Group, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria.
| | - Taye Bello
- Department of Medical Rehabilitation, College of Health Sciences, Obafemi Awolowo University, Nigeria.
| | - Ayodeji Adeola Adedeji
- Department of Biochemistry, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria.
| | - Aqsa Khalid
- Research Center for Modelling and Simulation (RCMS), National University of Science and Technology (NUST), Islamabad, Pakistan.
| | | | - Adunola Blessing Ayomide
- Computer Aided Therapeutics Laboratory (CATL) Group, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria; Department of Biotechnology, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria.
| | - Prosper Obed Chukwuemeka
- Computer Aided Therapeutics Laboratory (CATL) Group, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria; Computer Aided Therapeutics and Drug Design (CATDD) Group, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria; Department of Biotechnology, School of Sciences (SOS), Federal University of Technology Akure, P.M.B 704, Akure, Nigeria.
| |
Collapse
|
15
|
Yue R, Dutta A. Computational systems biology in disease modeling and control, review and perspectives. NPJ Syst Biol Appl 2022; 8:37. [PMID: 36192551 PMCID: PMC9528884 DOI: 10.1038/s41540-022-00247-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 09/05/2022] [Indexed: 02/02/2023] Open
Abstract
Omics-based approaches have become increasingly influential in identifying disease mechanisms and drug responses. Considering that diseases and drug responses are co-expressed and regulated in the relevant omics data interactions, the traditional way of grabbing omics data from single isolated layers cannot always obtain valuable inference. Also, drugs have adverse effects that may impair patients, and launching new medicines for diseases is costly. To resolve the above difficulties, systems biology is applied to predict potential molecular interactions by integrating omics data from genomic, proteomic, transcriptional, and metabolic layers. Combined with known drug reactions, the resulting models improve medicines' therapeutical performance by re-purposing the existing drugs and combining drug molecules without off-target effects. Based on the identified computational models, drug administration control laws are designed to balance toxicity and efficacy. This review introduces biomedical applications and analyses of interactions among gene, protein and drug molecules for modeling disease mechanisms and drug responses. The therapeutical performance can be improved by combining the predictive and computational models with drug administration designed by control laws. The challenges are also discussed for its clinical uses in this work.
Collapse
Affiliation(s)
- Rongting Yue
- Department of Electrical and Computer Engineering, University of Connecticut, 371 Fairfield Way, Storrs, CT, 06269, USA.
| | - Abhishek Dutta
- Department of Electrical and Computer Engineering, University of Connecticut, 371 Fairfield Way, Storrs, CT, 06269, USA
| |
Collapse
|
16
|
Behkamal B, Naghibzadeh M, Pagnani A, Saberi MR, Al Nasr K. LPTD: a novel linear programming-based topology determination method for cryo-EM maps. Bioinformatics 2022; 38:2734-2741. [PMID: 35561171 PMCID: PMC9306757 DOI: 10.1093/bioinformatics/btac170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 03/01/2022] [Accepted: 03/18/2022] [Indexed: 02/03/2023] Open
Abstract
SUMMARY Topology determination is one of the most important intermediate steps toward building the atomic structure of proteins from their medium-resolution cryo-electron microscopy (cryo-EM) map. The main goal in the topology determination is to identify correct matches (i.e. assignment and direction) between secondary structure elements (SSEs) (α-helices and β-sheets) detected in a protein sequence and cryo-EM density map. Despite many recent advances in molecular biology technologies, the problem remains a challenging issue. To overcome the problem, this article proposes a linear programming-based topology determination (LPTD) method to solve the secondary structure topology problem in three-dimensional geometrical space. Through modeling of the protein's sequence with the aid of extracting highly reliable features and a distance-based scoring function, the secondary structure matching problem is transformed into a complete weighted bipartite graph matching problem. Subsequently, an algorithm based on linear programming is developed as a decision-making strategy to extract the true topology (native topology) between all possible topologies. The proposed automatic framework is verified using 12 experimental and 15 simulated α-β proteins. Results demonstrate that LPTD is highly efficient and extremely fast in such a way that for 77% of cases in the dataset, the native topology has been detected in the first rank topology in <2 s. Besides, this method is able to successfully handle large complex proteins with as many as 65 SSEs. Such a large number of SSEs have never been solved with current tools/methods. AVAILABILITY AND IMPLEMENTATION The LPTD package (source code and data) is publicly available at https://github.com/B-Behkamal/LPTD. Moreover, two test samples as well as the instruction of utilizing the graphical user interface have been provided in the shared readme file. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahareh Behkamal
- Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad 9177948974, Iran
| | - Mahmoud Naghibzadeh
- Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad 9177948974, Iran
| | - Andrea Pagnani
- Department of Applied Science and Technology (DISAT), Politecnico di Torino, Torino I-10129, Italy
- Italian Institute for Genomic Medicine (IIGM), IRCC-Candiolo, Candiolo (TO) I-10060, Italy
- INFN Sezione di Torino, Torino I-10125, Italy
| | - Mohammad Reza Saberi
- Medicinal Chemistry Department, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad 9177899191, Iran
- Bioinformatics Research Group, Mashhad University of Medical Sciences, Mashhad 9177899191, Iran
| | - Kamal Al Nasr
- Department of Computer Science, Tennessee State University, Nashville, TN 37209, USA
| |
Collapse
|
17
|
Newton MAH, Rahman J, Zaman R, Sattar A. Enhancing Protein Contact Map Prediction Accuracy via Ensembles of Inter-Residue Distance Predictors. Comput Biol Chem 2022; 99:107700. [DOI: 10.1016/j.compbiolchem.2022.107700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 05/19/2022] [Accepted: 05/19/2022] [Indexed: 11/03/2022]
|
18
|
Santra S, Jana M. Predicting the evolution of number of native contacts of a small protein by using deep learning approach. Comput Biol Chem 2022; 97:107625. [DOI: 10.1016/j.compbiolchem.2022.107625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 01/07/2022] [Accepted: 01/09/2022] [Indexed: 11/28/2022]
|
19
|
Gaur A, Jindal Y, Singh V, Tiwari R, Kumar D, Kaushik D, Singh J, Narwal S, Jaiswal S, Iquebal MA, Angadi UB, Singh G, Rai A, Singh GP, Sheoran S. GWAS to Identify Novel QTNs for WSCs Accumulation in Wheat Peduncle Under Different Water Regimes. FRONTIERS IN PLANT SCIENCE 2022; 13:825687. [PMID: 35310635 PMCID: PMC8928439 DOI: 10.3389/fpls.2022.825687] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Accepted: 01/27/2022] [Indexed: 05/27/2023]
Abstract
Water-soluble carbohydrates (WSCs) play a vital role in water stress avoidance and buffering wheat grain yield. However, the genetic architecture of stem WSCs' accumulation is partially understood, and few candidate genes are known. This study utilizes the compressed mixed linear model-based genome wide association study (GWAS) and heuristic post GWAS analyses to identify causative quantitative trait nucleotides (QTNs) and candidate genes for stem WSCs' content at 15 days after anthesis under different water regimes (irrigated, rainfed, and drought). Glucose, fructose, sucrose, fructans, total non-structural carbohydrates (the sum of individual sugars), total WSCs (anthrone based) quantified in the peduncle of 301 bread wheat genotypes under multiple environments (E01-E08) pertaining different water regimes, and 14,571 SNPs from "35K Axiom Wheat Breeders" Array were used for analysis. As a result, 570 significant nucleotide trait associations were identified on all chromosomes except for 4D, of which 163 were considered stable. A total of 112 quantitative trait nucleotide regions (QNRs) were identified of which 47 were presumable novel. QNRs qWSC-3B.2 and qWSC-7A.2 were identified as the hotspots. Post GWAS integration of multiple data resources prioritized 208 putative candidate genes delimited into 64 QNRs, which can be critical in understanding the genetic architecture of stem WSCs accumulation in wheat under optimum and water-stressed environments. At least 19 stable QTNs were found associated with 24 prioritized candidate genes. Clusters of fructans metabolic genes reported in the QNRs qWSC-4A.2 and qWSC-7A.2. These genes can be utilized to bring an optimum combination of various fructans metabolic genes to improve the accumulation and remobilization of stem WSCs and water stress tolerance. These results will further strengthen wheat breeding programs targeting sustainable wheat production under limited water conditions.
Collapse
Affiliation(s)
- Arpit Gaur
- Department of Genetics and Plant Breeding, CCS Haryana Agricultural University, Hisar, India
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, India
| | - Yogesh Jindal
- Department of Genetics and Plant Breeding, CCS Haryana Agricultural University, Hisar, India
| | - Vikram Singh
- Department of Genetics and Plant Breeding, CCS Haryana Agricultural University, Hisar, India
| | - Ratan Tiwari
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, India
| | - Dinesh Kumar
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Deepak Kaushik
- Department of Genetics and Plant Breeding, CCS Haryana Agricultural University, Hisar, India
| | - Jogendra Singh
- ICAR-Central Soil Salinity Research Institute, Karnal, India
| | - Sneh Narwal
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Sarika Jaiswal
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Mir Asif Iquebal
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Ulavapp B. Angadi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Gyanendra Singh
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, India
| | - Anil Rai
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | | | - Sonia Sheoran
- ICAR-Indian Institute of Wheat and Barley Research, Karnal, India
| |
Collapse
|
20
|
Tran NH, Xu J, Li M. A tale of solving two computational challenges in protein science: neoantigen prediction and protein structure prediction. Brief Bioinform 2022; 23:bbab493. [PMID: 34891158 PMCID: PMC8769896 DOI: 10.1093/bib/bbab493] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Revised: 10/11/2021] [Accepted: 10/26/2021] [Indexed: 12/30/2022] Open
Abstract
In this article, we review two challenging computational questions in protein science: neoantigen prediction and protein structure prediction. Both topics have seen significant leaps forward by deep learning within the past five years, which immediately unlocked new developments of drugs and immunotherapies. We show that deep learning models offer unique advantages, such as representation learning and multi-layer architecture, which make them an ideal choice to leverage a huge amount of protein sequence and structure data to address those two problems. We also discuss the impact and future possibilities enabled by those two applications, especially how the data-driven approach by deep learning shall accelerate the progress towards personalized biomedicine.
Collapse
Affiliation(s)
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, USA
| | - Ming Li
- University of Waterloo, Canada
| |
Collapse
|
21
|
Peng CX, Zhou XG, Zhang GJ. De novo Protein Structure Prediction by Coupling Contact With Distance Profile. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:395-406. [PMID: 32750861 DOI: 10.1109/tcbb.2020.3000758] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
De novo protein structure prediction is a challenging problem that requires both an accurate energy function and an efficient conformation sampling method. In this study, a de novo structure prediction method, named CoDiFold, is proposed. In CoDiFold, contacts and distance profiles are organically combined into the Rosetta low-resolution energy function to improve the accuracy of energy function. As a result, the correlation between energy and root mean square deviation (RMSD) is improved. In addition, a population-based multi-mutation strategy is designed to balance the exploration and exploitation of conformation space sampling. The average RMSD of the models generated by the proposed protocol is decreased by 49.24 and 45.21 percent in the test set with 43 proteins compared with those of Rosetta and QUARK de novo protocols, respectively. The results also demonstrate that the structures predicted by proposed CoDiFold are comparable to the state-of-the-art methods for the 10 FM targets of CASP13. The source code and executable versions are freely available at http://github.com/iobio-zjut/CoDiFold.
Collapse
|
22
|
Rahbar MR, Jahangiri A, Khalili S, Zarei M, Mehrabani-Zeinabad K, Khalesi B, Pourzardosht N, Hessami A, Nezafat N, Sadraei S, Negahdaripour M. Hotspots for mutations in the SARS-CoV-2 spike glycoprotein: a correspondence analysis. Sci Rep 2021; 11:23622. [PMID: 34880279 PMCID: PMC8654821 DOI: 10.1038/s41598-021-01655-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2021] [Accepted: 11/01/2021] [Indexed: 12/19/2022] Open
Abstract
Spike glycoprotein (Sgp) is liable for binding of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) to the host receptors. Since Sgp is the main target for vaccine and drug designing, elucidating its mutation pattern could help in this regard. This study is aimed at investigating the correspondence of specific residues to the SgpSARS-CoV-2 functionality by explorative interpretation of sequence alignments. Centrality analysis of the Sgp dissects the importance of these residues in the interaction network of the RBD-ACE2 (receptor-binding domain) complex and furin cleavage site. Correspondence of RBD to threonine500 and asparagine501 and furin cleavage site to glutamine675, glutamine677, threonine678, and alanine684 was observed; all residues are exactly located at the interaction interfaces. The harmonious location of residues dictates the RBD binding property and the flexibility, hydrophobicity, and accessibility of the furin cleavage site. These species-specific residues can be assumed as real targets of evolution, while other substitutions tend to support them. Moreover, all these residues are parts of experimentally identified epitopes. Therefore, their substitution may affect vaccine efficacy. Higher rate of RBD maintenance than furin cleavage site was predicted. The accumulation of substitutions reinforces the probability of the multi-host circulation of the virus and emphasizes the enduring evolutionary events.
Collapse
Affiliation(s)
- Mohammad Reza Rahbar
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Abolfazl Jahangiri
- Applied Microbiology Research Center, Systems Biology and Poisonings Institute, Baqiyatallah University of Medical Sciences, Tehran, Iran
| | - Saeed Khalili
- Department of Biology Sciences, Shahid Rajaee Teacher Training University, Tehran, Iran
| | - Mahboubeh Zarei
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Kamran Mehrabani-Zeinabad
- Department of Biostatistics, Faculty of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Bahman Khalesi
- Department of Research and Production of Poultry Viral Vaccine, Razi Vaccine, and Serum Research Institute, Agricultural Research Education and Extension Organization (AREEO), Karaj, Iran
| | - Navid Pourzardosht
- Cellular and Molecular Research Center, Faculty of Medicine, Guilan University of Medical Sciences, Rasht, Iran
- Biochemistry Department, Guilan University of Medical Sciences, Rasht, Iran
| | - Anahita Hessami
- School of Pharmacy, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Navid Nezafat
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Saman Sadraei
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Manica Negahdaripour
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran.
- Department of Pharmaceutical Biotechnology, School of Pharmacy, Shiraz University of Medical Sciences, P.O. Box 71345-1583, Shiraz, Iran.
| |
Collapse
|
23
|
Li Y, Zhang C, Zheng W, Zhou X, Bell EW, Yu DJ, Zhang Y. Protein inter-residue contact and distance prediction by coupling complementary coevolution features with deep residual networks in CASP14. Proteins 2021; 89:1911-1921. [PMID: 34382712 PMCID: PMC8616805 DOI: 10.1002/prot.26211] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 07/24/2021] [Accepted: 08/05/2021] [Indexed: 01/12/2023]
Abstract
This article reports and analyzes the results of protein contact and distance prediction by our methods in the 14th Critical Assessment of techniques for protein Structure Prediction (CASP14). A new deep learning-based contact/distance predictor was employed based on the ensemble of two complementary coevolution features coupling with deep residual networks. We also improved our multiple sequence alignment (MSA) generation protocol with wholesale meta-genome sequence databases. On 22 CASP14 free modeling (FM) targets, the proposed model achieved a top-L/5 long-range precision of 63.8% and a mean distance bin error of 1.494. Based on the predicted distance potentials, 11 out of 22 FM targets and all of the 14 FM/template-based modeling (TBM) targets have correctly predicted folds (TM-score >0.5), suggesting that our approach can provide reliable distance potentials for ab initio protein folding.
Collapse
Affiliation(s)
- Yang Li
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Xiaogen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Eric W. Bell
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
24
|
Hou M, Peng C, Zhou X, Zhang B, Zhang G. Multi contact-based folding method for de novo protein structure prediction. Brief Bioinform 2021; 23:6445108. [PMID: 34849573 DOI: 10.1093/bib/bbab463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 09/21/2021] [Accepted: 10/10/2021] [Indexed: 11/12/2022] Open
Abstract
Meta contact, which combines different contact maps into one to improve contact prediction accuracy and effectively reduce the noise from a single contact map, is a widely used method. However, protein structure prediction using meta contact cannot fully exploit the information carried by original contact maps. In this work, a multi contact-based folding method under the evolutionary algorithm framework, MultiCFold, is proposed. In MultiCFold, the thorough information of different contact maps is directly used by populations to guide protein structure folding. In addition, noncontact is considered as an effective supplement to contact information and can further assist protein folding. MultiCFold is tested on a set of 120 nonredundant proteins, and the average TM-score and average RMSD reach 0.617 and 5.815 Å, respectively. Compared with the meta contact-based method, MetaCFold, average TM-score and average RMSD have a 6.62 and 8.82% improvement. In particular, the import of noncontact information increases the average TM-score by 6.30%. Furthermore, MultiCFold is compared with four state-of-the-art methods of CASP13 on the 24 FM targets, and results show that MultiCFold is significantly better than other methods after the full-atom relax procedure.
Collapse
Affiliation(s)
- Minghua Hou
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Chunxiang Peng
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Xiaogen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Hangzhou 310023, China
| | - Biao Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Guijun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
25
|
Behkamal B, Naghibzadeh M, Saberi MR, Tehranizadeh ZA, Pagnani A, Al Nasr K. Three-Dimensional Graph Matching to Identify Secondary Structure Correspondence of Medium-Resolution Cryo-EM Density Maps. Biomolecules 2021; 11:1773. [PMID: 34944417 PMCID: PMC8698881 DOI: 10.3390/biom11121773] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 11/18/2021] [Accepted: 11/20/2021] [Indexed: 01/15/2023] Open
Abstract
Cryo-electron microscopy (cryo-EM) is a structural technique that has played a significant role in protein structure determination in recent years. Compared to the traditional methods of X-ray crystallography and NMR spectroscopy, cryo-EM is capable of producing images of much larger protein complexes. However, cryo-EM reconstructions are limited to medium-resolution (~4-10 Å) for some cases. At this resolution range, a cryo-EM density map can hardly be used to directly determine the structure of proteins at atomic level resolutions, or even at their amino acid residue backbones. At such a resolution, only the position and orientation of secondary structure elements (SSEs) such as α-helices and β-sheets are observable. Consequently, finding the mapping of the secondary structures of the modeled structure (SSEs-A) to the cryo-EM map (SSEs-C) is one of the primary concerns in cryo-EM modeling. To address this issue, this study proposes a novel automatic computational method to identify SSEs correspondence in three-dimensional (3D) space. Initially, through a modeling of the target sequence with the aid of extracting highly reliable features from a generated 3D model and map, the SSEs matching problem is formulated as a 3D vector matching problem. Afterward, the 3D vector matching problem is transformed into a 3D graph matching problem. Finally, a similarity-based voting algorithm combined with the principle of least conflict (PLC) concept is developed to obtain the SSEs correspondence. To evaluate the accuracy of the method, a testing set of 25 experimental and simulated maps with a maximum of 65 SSEs is selected. Comparative studies are also conducted to demonstrate the superiority of the proposed method over some state-of-the-art techniques. The results demonstrate that the method is efficient, robust, and works well in the presence of errors in the predicted secondary structures of the cryo-EM images.
Collapse
Affiliation(s)
- Bahareh Behkamal
- Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad 9177948974, Iran;
| | - Mahmoud Naghibzadeh
- Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad 9177948974, Iran;
| | - Mohammad Reza Saberi
- Medicinal Chemistry Department, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad 9177899191, Iran; (M.R.S.); (Z.A.T.)
- Bioinformatics Research Group, Mashhad University of Medical Sciences, Mashhad 9177899191, Iran
| | - Zeinab Amiri Tehranizadeh
- Medicinal Chemistry Department, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad 9177899191, Iran; (M.R.S.); (Z.A.T.)
| | - Andrea Pagnani
- Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy;
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo, Italy
- INFN, Sezione di Torino, I-10125 Torino, Italy
| | - Kamal Al Nasr
- Department of Computer Science, Tennessee State University, Nashville, TN 37209, USA
| |
Collapse
|
26
|
Wei H, Zhao Z, Luo R. Machine-Learned Molecular Surface and Its Application to Implicit Solvent Simulations. J Chem Theory Comput 2021; 17:6214-6224. [PMID: 34516109 DOI: 10.1021/acs.jctc.1c00492] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Implicit solvent models, such as Poisson-Boltzmann models, play important roles in computational studies of biomolecules. A vital step in almost all implicit solvent models is to determine the solvent-solute interface, and the solvent excluded surface (SES) is the most widely used interface definition in these models. However, classical algorithms used for computing SES are geometry-based, so that they are neither suitable for parallel implementations nor convenient for obtaining surface derivatives. To address the limitations, we explored a machine learning strategy to obtain a level set formulation for the SES. The training process was conducted in three steps, eventually leading to a model with over 95% agreement with the classical SES. Visualization of tested molecular surfaces shows that the machine-learned SES overlaps with the classical SES in almost all situations. Further analyses show that the machine-learned SES is incredibly stable in terms of rotational variation of tested molecules. Our timing analysis shows that the machine-learned SES is roughly 2.5 times as efficient as the classical SES routine implemented in Amber/PBSA on a tested central processing unit (CPU) platform. We expect further performance gain on massively parallel platforms such as graphics processing units (GPUs) given the ease in converting the machine-learned SES to a parallel procedure. We also implemented the machine-learned SES into the Amber/PBSA program to study its performance on reaction field energy calculation. The analysis shows that the two sets of reaction field energies are highly consistent with a 1% deviation on average. Given its level set formulation, we expect the machine-learned SES to be applied in molecular simulations that require either surface derivatives or high efficiency on parallel computing platforms.
Collapse
Affiliation(s)
- Haixin Wei
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| | - Zekai Zhao
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| | - Ray Luo
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| |
Collapse
|
27
|
Laine E, Eismann S, Elofsson A, Grudinin S. Protein sequence-to-structure learning: Is this the end(-to-end revolution)? Proteins 2021; 89:1770-1786. [PMID: 34519095 DOI: 10.1002/prot.26235] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/16/2021] [Accepted: 09/03/2021] [Indexed: 01/08/2023]
Abstract
The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, that is, learning on representations such as graphs, three-dimensional (3D) Voronoi tessellations, and point clouds; (ii) pretrained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; and (vi) finally truly end-to-end architectures, that is, differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last 2 years and widely used in CASP14.
Collapse
Affiliation(s)
- Elodie Laine
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| | - Stephan Eismann
- Department of Computer Science and Applied Physics, Stanford University, Stanford, California, USA
| | - Arne Elofsson
- Department of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Sergei Grudinin
- Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, Grenoble, France
| |
Collapse
|
28
|
Hong Z, Liu J, Chen Y. An interpretable machine learning method for homo-trimeric protein interface residue-residue interaction prediction. Biophys Chem 2021; 278:106666. [PMID: 34418678 DOI: 10.1016/j.bpc.2021.106666] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Revised: 08/09/2021] [Accepted: 08/09/2021] [Indexed: 12/29/2022]
Abstract
Protein-protein interaction plays an important role in life activities. A more fine-grained analysis, such as residues and atoms level, will better benefit us to understand the mechanism for inter-protein interaction and drug design. The development of efficient computational methods to reduce trials and errors, as well as assisting experimental researchers to determine the complex structure are some of the ongoing studies in the field. The research of trimer protein interface, especially homotrimer, has been rarely studied. In this paper, we proposed an interpretable machine learning method for homo-trimeric protein interface residue pairs prediction. The structure, sequence, and physicochemical information are intergraded as feature input fed to model for training. Graph model is utilized to present spatial information for intra-protein. Matrix factorization captures the different features' interactions. Kernel function is designed to auto-acquire the adjacent information of our target residue pairs. The accuracy rate achieves 54.5% in an independent test set. Sequence and structure alignment exhibit the ability of model self-study. Our model indicates the biological significance between sequence and structure, and could be auxiliary for reducing trials and errors in the fields of protein complex determination and protein-protein docking, etc. SIGNIFICANCE: Protein complex structures are significant for understanding protein function and promising functional protein design. With data increasing, some computational tools have been developed for protein complex residue contact prediction, which is one of the most significant steps for complex structure prediction. But for homo-trimeric protein, the sequence-based deep learning predictors are infeasible for homologous sequences, and the algorithm black box prevents us from understanding of each step operation. In this way, we propose an interpreting machine learning method for homo-trimeric protein interface residue-residue interaction prediction, and the predictor shows a good performance. Our work provides a computational auxiliary way for determining the homo-trimeric proteins interface residue pairs which will be further verified by wet experiments, and and gives a hand for the downstream works, such as protein-protein docking, protein complex structure prediction and drug design.
Collapse
Affiliation(s)
- Zhonghua Hong
- Jiaxing Hospital of Traditional Chinese Medicine, Jiaxing University, Jiaxing 314001, PR China.
| | - Jiale Liu
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, PR China
| | - Yinggao Chen
- Shantou Central Hospital, Shantou 515041, PR China.
| |
Collapse
|
29
|
Pearce R, Zhang Y. Toward the solution of the protein structure prediction problem. J Biol Chem 2021; 297:100870. [PMID: 34119522 PMCID: PMC8254035 DOI: 10.1016/j.jbc.2021.100870] [Citation(s) in RCA: 63] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 06/07/2021] [Accepted: 06/09/2021] [Indexed: 11/20/2022] Open
Abstract
Since Anfinsen demonstrated that the information encoded in a protein's amino acid sequence determines its structure in 1973, solving the protein structure prediction problem has been the Holy Grail of structural biology. The goal of protein structure prediction approaches is to utilize computational modeling to determine the spatial location of every atom in a protein molecule starting from only its amino acid sequence. Depending on whether homologous structures can be found in the Protein Data Bank (PDB), structure prediction methods have been historically categorized as template-based modeling (TBM) or template-free modeling (FM) approaches. Until recently, TBM has been the most reliable approach to predicting protein structures, and in the absence of reliable templates, the modeling accuracy sharply declines. Nevertheless, the results of the most recent community-wide assessment of protein structure prediction experiment (CASP14) have demonstrated that the protein structure prediction problem can be largely solved through the use of end-to-end deep machine learning techniques, where correct folds could be built for nearly all single-domain proteins without using the PDB templates. Critically, the model quality exhibited little correlation with the quality of available template structures, as well as the number of sequence homologs detected for a given target protein. Thus, the implementation of deep-learning techniques has essentially broken through the 50-year-old modeling border between TBM and FM approaches and has made the success of high-resolution structure prediction significantly less dependent on template availability in the PDB library.
Collapse
Affiliation(s)
- Robin Pearce
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan, USA.
| |
Collapse
|
30
|
Reza MS, Zhang H, Hossain MT, Jin L, Feng S, Wei Y. COMTOP: Protein Residue-Residue Contact Prediction through Mixed Integer Linear Optimization. MEMBRANES 2021; 11:membranes11070503. [PMID: 34209399 PMCID: PMC8305966 DOI: 10.3390/membranes11070503] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Revised: 06/24/2021] [Accepted: 06/25/2021] [Indexed: 11/17/2022]
Abstract
Protein contact prediction helps reconstruct the tertiary structure that greatly determines a protein’s function; therefore, contact prediction from the sequence is an important problem. Recently there has been exciting progress on this problem, but many of the existing methods are still low quality of prediction accuracy. In this paper, we present a new mixed integer linear programming (MILP)-based consensus method: a Consensus scheme based On a Mixed integer linear opTimization method for prOtein contact Prediction (COMTOP). The MILP-based consensus method combines the strengths of seven selected protein contact prediction methods, including CCMpred, EVfold, DeepCov, NNcon, PconsC4, plmDCA, and PSICOV, by optimizing the number of correctly predicted contacts and achieving a better prediction accuracy. The proposed hybrid protein residue–residue contact prediction scheme was tested in four independent test sets. For 239 highly non-redundant proteins, the method showed a prediction accuracy of 59.68%, 70.79%, 78.86%, 89.04%, 94.51%, and 97.35% for top-5L, top-3L, top-2L, top-L, top-L/2, and top-L/5 contacts, respectively. When tested on the CASP13 and CASP14 test sets, the proposed method obtained accuracies of 75.91% and 77.49% for top-L/5 predictions, respectively. COMTOP was further tested on 57 non-redundant α-helical transmembrane proteins and achieved prediction accuracies of 64.34% and 73.91% for top-L/2 and top-L/5 predictions, respectively. For all test datasets, the improvement of COMTOP in accuracy over the seven individual methods increased with the increasing number of predicted contacts. For example, COMTOP performed much better for large number of contact predictions (such as top-5L and top-3L) than for small number of contact predictions such as top-L/2 and top-L/5. The results and analysis demonstrate that COMTOP can significantly improve the performance of the individual methods; therefore, COMTOP is more robust against different types of test sets. COMTOP also showed better/comparable predictions when compared with the state-of-the-art predictors.
Collapse
Affiliation(s)
- Md. Selim Reza
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Huiling Zhang
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Md. Tofazzal Hossain
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Langxi Jin
- Department of Computer Science and Technology, School of Computer Science and Technology, Harbin University of Science and Technology, 52 Xuefu Road, Nangang District, Harbin 150080, China;
| | - Shengzhong Feng
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Yanjie Wei
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
- Correspondence:
| |
Collapse
|
31
|
Pérez-Vargas J, Teppa E, Amirache F, Boson B, Pereira de Oliveira R, Combet C, Böckmann A, Fusil F, Freitas N, Carbone A, Cosset FL. A fusion peptide in preS1 and the human protein disulfide isomerase ERp57 are involved in hepatitis B virus membrane fusion process. eLife 2021; 10:64507. [PMID: 34190687 PMCID: PMC8282342 DOI: 10.7554/elife.64507] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Accepted: 06/29/2021] [Indexed: 12/13/2022] Open
Abstract
Cell entry of enveloped viruses relies on the fusion between the viral and plasma or endosomal membranes, through a mechanism that is triggered by a cellular signal. Here we used a combination of computational and experimental approaches to unravel the main determinants of hepatitis B virus (HBV) membrane fusion process. We discovered that ERp57 is a host factor critically involved in triggering HBV fusion and infection. Then, through modeling approaches, we uncovered a putative allosteric cross-strand disulfide (CSD) bond in the HBV S glycoprotein and we demonstrate that its stabilization could prevent membrane fusion. Finally, we identified and characterized a potential fusion peptide in the preS1 domain of the HBV L glycoprotein. These results underscore a membrane fusion mechanism that could be triggered by ERp57, allowing a thiol/disulfide exchange reaction to occur and regulate isomerization of a critical CSD, which ultimately leads to the exposition of the fusion peptide.
Collapse
Affiliation(s)
- Jimena Pérez-Vargas
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Université Claude Bernard Lyon 1, Inserm, U1111, CNRS, UMR5308, ENS Lyon, Lyon, France
| | - Elin Teppa
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB) - UMR 7238, Paris, France.,Sorbonne Université, Institut des Sciences du Calcul et des Données (ISCD), Paris, France
| | - Fouzia Amirache
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Université Claude Bernard Lyon 1, Inserm, U1111, CNRS, UMR5308, ENS Lyon, Lyon, France
| | - Bertrand Boson
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Université Claude Bernard Lyon 1, Inserm, U1111, CNRS, UMR5308, ENS Lyon, Lyon, France
| | - Rémi Pereira de Oliveira
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Université Claude Bernard Lyon 1, Inserm, U1111, CNRS, UMR5308, ENS Lyon, Lyon, France
| | - Christophe Combet
- Cancer Research Center of Lyon (CRCL), UMR Inserm 1052 - CNRS 5286 - Université Lyon 1 - Centre Léon Bérard, Lyon, France
| | - Anja Böckmann
- Molecular Microbiology and Structural Biochemistry, UMR5086 CNRS-Université Lyon 1, Lyon, France
| | - Floriane Fusil
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Université Claude Bernard Lyon 1, Inserm, U1111, CNRS, UMR5308, ENS Lyon, Lyon, France
| | - Natalia Freitas
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Université Claude Bernard Lyon 1, Inserm, U1111, CNRS, UMR5308, ENS Lyon, Lyon, France
| | - Alessandra Carbone
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB) - UMR 7238, Paris, France
| | - François-Loïc Cosset
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Université Claude Bernard Lyon 1, Inserm, U1111, CNRS, UMR5308, ENS Lyon, Lyon, France
| |
Collapse
|
32
|
Billings WM, Morris CJ, Della Corte D. The whole is greater than its parts: ensembling improves protein contact prediction. Sci Rep 2021; 11:8039. [PMID: 33850214 PMCID: PMC8044223 DOI: 10.1038/s41598-021-87524-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Accepted: 03/29/2021] [Indexed: 11/30/2022] Open
Abstract
The prediction of amino acid contacts from protein sequence is an important problem, as protein contacts are a vital step towards the prediction of folded protein structures. We propose that a powerful concept from deep learning, called ensembling, can increase the accuracy of protein contact predictions by combining the outputs of different neural network models. We show that ensembling the predictions made by different groups at the recent Critical Assessment of Protein Structure Prediction (CASP13) outperforms all individual groups. Further, we show that contacts derived from the distance predictions of three additional deep neural networks-AlphaFold, trRosetta, and ProSPr-can be substantially improved by ensembling all three networks. We also show that ensembling these recent deep neural networks with the best CASP13 group creates a superior contact prediction tool. Finally, we demonstrate that two ensembled networks can successfully differentiate between the folds of two highly homologous sequences. In order to build further on these findings, we propose the creation of a better protein contact benchmark set and additional open-source contact prediction methods.
Collapse
Affiliation(s)
- Wendy M Billings
- Department of Physics and Astronomy, Brigham Young University, Provo, UT, USA
| | - Connor J Morris
- Department of Physics and Astronomy, Brigham Young University, Provo, UT, USA
| | - Dennis Della Corte
- Department of Physics and Astronomy, Brigham Young University, Provo, UT, USA.
| |
Collapse
|
33
|
Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks. PLoS Comput Biol 2021; 17:e1008865. [PMID: 33770072 PMCID: PMC8026059 DOI: 10.1371/journal.pcbi.1008865] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 04/07/2021] [Accepted: 03/10/2021] [Indexed: 12/24/2022] Open
Abstract
The topology of protein folds can be specified by the inter-residue contact-maps and accurate contact-map prediction can help ab initio structure folding. We developed TripletRes to deduce protein contact-maps from discretized distance profiles by end-to-end training of deep residual neural-networks. Compared to previous approaches, the major advantage of TripletRes is in its ability to learn and directly fuse a triplet of coevolutionary matrices extracted from the whole-genome and metagenome databases and therefore minimize the information loss during the course of contact model training. TripletRes was tested on a large set of 245 non-homologous proteins from CASP 11&12 and CAMEO experiments and outperformed other top methods from CASP12 by at least 58.4% for the CASP 11&12 targets and 44.4% for the CAMEO targets in the top-L long-range contact precision. On the 31 FM targets from the latest CASP13 challenge, TripletRes achieved the highest precision (71.6%) for the top-L/5 long-range contact predictions. It was also shown that a simple re-training of the TripletRes model with more proteins can lead to further improvement with precisions comparable to state-of-the-art methods developed after CASP13. These results demonstrate a novel efficient approach to extend the power of deep convolutional networks for high-accuracy medium- and long-range protein contact-map predictions starting from primary sequences, which are critical for constructing 3D structure of proteins that lack homologous templates in the PDB library. Ab initio protein folding has been a major unsolved problem in computational biology for more than half a century. Recent community-wide Critical Assessment of Structure Prediction (CASP) experiments have witnessed exciting progress on ab initio structure prediction, which was mainly powered by the boosting of contact-map prediction as the latter can be used as constraints to guide ab initio folding simulations. In this work, we proposed a new open-source deep-learning architecture, TripletRes, built on the residual convolutional neural networks for high-accuracy contact prediction. The large-scale benchmark and blind test results demonstrate competitive performance of the proposed methods to other top approaches in predicting medium- and long-range contact-maps that are critical for guiding protein folding simulations. Detailed data analyses showed that the major advantage of TripletRes lies in the unique protocol to fuse multiple evolutionary feature matrices which are directly extracted from whole-genome and metagenome databases and therefore minimize the information loss during the contact model training.
Collapse
|
34
|
Gao W, Mahajan SP, Sulam J, Gray JJ. Deep Learning in Protein Structural Modeling and Design. PATTERNS (NEW YORK, N.Y.) 2020; 1:100142. [PMID: 33336200 PMCID: PMC7733882 DOI: 10.1016/j.patter.2020.100142] [Citation(s) in RCA: 100] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Deep learning is catalyzing a scientific revolution fueled by big data, accessible toolkits, and powerful computational resources, impacting many fields, including protein structural modeling. Protein structural modeling, such as predicting structure from amino acid sequence and evolutionary information, designing proteins toward desirable functionality, or predicting properties or behavior of a protein, is critical to understand and engineer biological systems at the molecular level. In this review, we summarize the recent advances in applying deep learning techniques to tackle problems in protein structural modeling and design. We dissect the emerging approaches using deep learning techniques for protein structural modeling and discuss advances and challenges that must be addressed. We argue for the central importance of structure, following the "sequence → structure → function" paradigm. This review is directed to help both computational biologists to gain familiarity with the deep learning methods applied in protein modeling, and computer scientists to gain perspective on the biologically meaningful problems that may benefit from deep learning techniques.
Collapse
Affiliation(s)
- Wenhao Gao
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Sai Pooja Mahajan
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeremias Sulam
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeffrey J. Gray
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
35
|
Kamerzell TJ, Middaugh CR. Prediction Machines: Applied Machine Learning for Therapeutic Protein Design and Development. J Pharm Sci 2020; 110:665-681. [PMID: 33278409 DOI: 10.1016/j.xphs.2020.11.034] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2020] [Revised: 11/27/2020] [Accepted: 11/27/2020] [Indexed: 12/11/2022]
Abstract
The rapid growth in technological advances and quantity of scientific data over the past decade has led to several challenges including data storage and analysis. Accurate models of complex datasets were previously difficult to develop and interpret. However, improvements in machine learning algorithms have since enabled unparalleled classification and prediction capabilities. The application of machine learning can be seen throughout diverse industries due to their ease of use and interpretability. In this review, we describe popular machine learning algorithms and highlight their application in pharmaceutical protein development. Machine learning models have now been applied to better understand the nonlinear concentration dependent viscosity of protein solutions, predict protein oxidation and deamidation rates, classify sub-visible particles and compare the physical stability of proteins. We also applied several machine learning algorithms using previously published data and describe models with improved predictions and classification. The authors hope that this review can be used as a resource to others and encourage continued application of machine learning algorithms to problems in pharmaceutical protein development.
Collapse
Affiliation(s)
- Tim J Kamerzell
- Department of Pharmaceutical Chemistry, The University of Kansas, Lawrence, KS, USA; Division of Internal Medicine, HCA MidWest Health, Overland Park, KS, USA.
| | - C Russell Middaugh
- Department of Pharmaceutical Chemistry, The University of Kansas, Lawrence, KS, USA
| |
Collapse
|
36
|
Chasing coevolutionary signals in intrinsically disordered proteins complexes. Sci Rep 2020; 10:17962. [PMID: 33087759 PMCID: PMC7578644 DOI: 10.1038/s41598-020-74791-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Accepted: 08/27/2020] [Indexed: 11/30/2022] Open
Abstract
Intrinsically disordered proteins/regions (IDPs/IDRs) are crucial components of the cell, they are highly abundant and participate ubiquitously in a wide range of biological functions, such as regulatory processes and cell signaling. Many of their important functions rely on protein interactions, by which they trigger or modulate different pathways. Sequence covariation, a powerful tool for protein contact prediction, has been applied successfully to predict protein structure and to identify protein–protein interactions mostly of globular proteins. IDPs/IDRs also mediate a plethora of protein–protein interactions, highlighting the importance of addressing sequence covariation-based inter-protein contact prediction of this class of proteins. Despite their importance, a systematic approach to analyze the covariation phenomena of intrinsically disordered proteins and their complexes is still missing. Here we carry out a comprehensive critical assessment of coevolution-based contact prediction in IDP/IDR complexes and detail the challenges and possible limitations that emerge from their analysis. We found that the coevolutionary signal is faint in most of the complexes of disordered proteins but positively correlates with the interface size and binding affinity between partners. In addition, we discuss the state-of-art methodology by biological interpretation of the results, formulate evaluation guidelines and suggest future directions of development to the field.
Collapse
|
37
|
Liu J, Zhou XG, Zhang Y, Zhang GJ. CGLFold: a contact-assisted de novo protein structure prediction using global exploration and loop perturbation sampling algorithm. Bioinformatics 2020; 36:2443-2450. [PMID: 31860059 DOI: 10.1093/bioinformatics/btz943] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 12/10/2019] [Accepted: 12/18/2019] [Indexed: 12/27/2022] Open
Abstract
MOTIVATION Regions that connect secondary structure elements in a protein are known as loops, whose slight change will produce dramatic effect on the entire topology. This study investigates whether the accuracy of protein structure prediction can be improved using a loop-specific sampling strategy. RESULTS A novel de novo protein structure prediction method that combines global exploration and loop perturbation is proposed in this study. In the global exploration phase, the fragment recombination and assembly are used to explore the massive conformational space and generate native-like topology. In the loop perturbation phase, a loop-specific local perturbation model is designed to improve the accuracy of the conformation and is solved by differential evolution algorithm. These two phases enable a cooperation between global exploration and local exploitation. The filtered contact information is used to construct the conformation selection model for guiding the sampling. The proposed CGLFold is tested on 145 benchmark proteins, 14 free modeling (FM) targets of CASP13 and 29 FM targets of CASP12. The experimental results show that the loop-specific local perturbation can increase the structure diversity and success rate of conformational update and gradually improve conformation accuracy. CGLFold obtains template modeling score ≥ 0.5 models on 95 standard test proteins, 7 FM targets of CASP13 and 9 FM targets of CASP12. AVAILABILITY AND IMPLEMENTATION The source code and executable versions are freely available at https://github.com/iobio-zjut/CGLFold. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Liu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Xiao-Gen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
38
|
Forsberg BO, Aibara S, Howard RJ, Mortezaei N, Lindahl E. Arrangement and symmetry of the fungal E3BP-containing core of the pyruvate dehydrogenase complex. Nat Commun 2020; 11:4667. [PMID: 32938938 PMCID: PMC7494870 DOI: 10.1038/s41467-020-18401-z] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Accepted: 08/20/2020] [Indexed: 11/21/2022] Open
Abstract
The pyruvate dehydrogenase complex (PDC) is a multienzyme complex central to aerobic respiration, connecting glycolysis to mitochondrial oxidation of pyruvate. Similar to the E3-binding protein (E3BP) of mammalian PDC, PX selectively recruits E3 to the fungal PDC, but its divergent sequence suggests a distinct structural mechanism. Here, we report reconstructions of PDC from the filamentous fungus Neurospora crassa by cryo-electron microscopy, where we find protein X (PX) interior to the PDC core as opposed to substituting E2 core subunits as in mammals. Steric occlusion limits PX binding, resulting in predominantly tetrahedral symmetry, explaining previous observations in Saccharomyces cerevisiae. The PX-binding site is conserved in (and specific to) fungi, and complements possible C-terminal binding motifs in PX that are absent in mammalian E3BP. Consideration of multiple symmetries thus reveals a differential structural basis for E3BP-like function in fungal PDC. The pyruvate dehydrogenase complex (PDC) is a multienzyme complex connecting glycolysis to mitochondrial oxidation of pyruvate. Cryo-EM analysis of PDC from Neurospora crassa reveals localization of fungi-specific protein X (PX) and confirms that it functions like the mammalian E3BP, recruiting the E3 component of PDC.
Collapse
Affiliation(s)
- B O Forsberg
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, 17165, Solna, Sweden
| | - S Aibara
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, 17165, Solna, Sweden.,Department of Molecular Biology, Max Planck Institute for Biophysical Chemistry, 37077, Göttingen, Germany
| | - R J Howard
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, 17165, Solna, Sweden
| | - N Mortezaei
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, 17165, Solna, Sweden.,Vironova AB, 11330, Stockholm, Sweden
| | - E Lindahl
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, 17165, Solna, Sweden. .,Department of Applied Physics, Swedish eScience Research Center, KTH Royal Institute of Technology, 17168, Solna, Sweden.
| |
Collapse
|
39
|
Augestad EH, Castelli M, Clementi N, Ströh LJ, Krey T, Burioni R, Mancini N, Bukh J, Prentoe J. Global and local envelope protein dynamics of hepatitis C virus determine broad antibody sensitivity. SCIENCE ADVANCES 2020; 6:eabb5938. [PMID: 32923643 PMCID: PMC7449684 DOI: 10.1126/sciadv.abb5938] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Accepted: 07/13/2020] [Indexed: 05/03/2023]
Abstract
Broad antibody sensitivity differences of hepatitis C virus (HCV) isolates and their ability to persist in the presence of neutralizing antibodies (NAbs) remain poorly understood. Here, we show that polymorphisms within glycoprotein E2, including hypervariable region 1 (HVR1) and antigenic site 412 (AS412), broadly affect NAb sensitivity by shifting global envelope protein conformation dynamics between theoretical "closed," neutralization-resistant and "open," neutralization-sensitive states. The conformational space of AS412 was skewed toward β-hairpin-like conformations in closed states, which also depended on HVR1, assigning function to these enigmatic E2 regions. Scavenger receptor class B, type I entry dependency of HCV was associated with NAb resistance and correlated perfectly with decreased virus propensity to interact with HCV co-receptor CD81, indicating that decreased NAb sensitivity resulted in a more complex entry pathway. This link between global E1/E2 states and functionally distinct AS412 conformations has important implications for targeting AS412 in rational HCV vaccine designs.
Collapse
Affiliation(s)
- Elias H. Augestad
- Copenhagen Hepatitis C Program (CO-HEP), Department of Infectious Diseases, Hvidovre Hospital, and Department of Immunology and Microbiology, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Matteo Castelli
- Laboratory of Microbiology and Virology, Università “Vita-Salute” San Raffaele, Milano, 20132, Italy
| | - Nicola Clementi
- Laboratory of Microbiology and Virology, Università “Vita-Salute” San Raffaele, Milano, 20132, Italy
| | - Luisa J. Ströh
- Institute of Virology, Hannover Medical School, Carl-Neuberg-Str. 1, Hannover 30625, Germany
| | - Thomas Krey
- Institute of Virology, Hannover Medical School, Carl-Neuberg-Str. 1, Hannover 30625, Germany
- German Center for Infection Research (DZIF), partner sites Hannover-Braunschweig and Hamburg-Lübeck-Borstel-Riems, Germany
- Center of Structural and Cell Biology in Medicine, Institute of Biochemistry, University of Luebeck, Ratzeburger Allee 160, 23562 Luebeck, Germany
- Cluster of Excellence RESIST (EXC 2155), Hannover Medical School, Carl-Neuberg-Str. 1, 30625 Hannover, Germany
- Centre for Structural Systems Biology (CSSB), Notkestraße 85, 22607 Hamburg, Germany
| | - Roberto Burioni
- Laboratory of Microbiology and Virology, Università “Vita-Salute” San Raffaele, Milano, 20132, Italy
| | - Nicasio Mancini
- Laboratory of Microbiology and Virology, Università “Vita-Salute” San Raffaele, Milano, 20132, Italy
| | - Jens Bukh
- Copenhagen Hepatitis C Program (CO-HEP), Department of Infectious Diseases, Hvidovre Hospital, and Department of Immunology and Microbiology, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Jannick Prentoe
- Copenhagen Hepatitis C Program (CO-HEP), Department of Infectious Diseases, Hvidovre Hospital, and Department of Immunology and Microbiology, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
- Corresponding author.
| |
Collapse
|
40
|
Ge R, Feng G, Jing X, Zhang R, Wang P, Wu Q. EnACP: An Ensemble Learning Model for Identification of Anticancer Peptides. Front Genet 2020; 11:760. [PMID: 32903636 PMCID: PMC7438906 DOI: 10.3389/fgene.2020.00760] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Accepted: 06/26/2020] [Indexed: 12/13/2022] Open
Abstract
As cancer remains one of the main threats of human life, developing efficient cancer treatments is urgent. Anticancer peptides, which could overcome the significant side effects and poor results of traditional cancer treatments, have become a new potential alternative these years. However, identifying anticancer peptides by experimental methods is time consuming and resource consuming, it is of great significance to develop effective computational tools to quickly and accurately identify potential anticancer peptides from amino acid sequences. For most current computational methods, feature representation plays a key role in their final successes. This study proposes a novel fast and accurate approach to identify anticancer peptides using diversified feature representations and ensemble learning method. For the feature representations, the information is encoded from multidimensional feature spaces, including sequence composition, sequence-order, physicochemical properties, etc. In order to better model the potential relationships of peptides, multiple ensemble classifiers, LightGBMs, are applied to detect the different feature sets at first. Then the obtained multiple outputs are used as inputs of the support vector machine classifier, which effectively identifies anticancer peptides. Experimental results on cross validation and independent test sets demonstrate that our method can achieve better or comparable performances compared with other state-of-the-art methods.
Collapse
Affiliation(s)
- Ruiquan Ge
- Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Guanwen Feng
- Xi'an Key Laboratory of Big Data and Intelligent Vision, School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiaoyang Jing
- Toyota Technological Institute at Chicago, Chicago, IL, United States
| | - Renfeng Zhang
- Shandong Provincial Hospital Affiliated to Shandong First Medical University, Jinan, China
| | - Pu Wang
- Computer School, Hubei University of Arts and Science, Xiangyang, China
| | - Qing Wu
- Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| |
Collapse
|
41
|
Li Y, Hu J, Zhang C, Yu DJ, Zhang Y. ResPRE: high-accuracy protein contact prediction by coupling precision matrix with deep residual neural networks. Bioinformatics 2020; 35:4647-4655. [PMID: 31070716 DOI: 10.1093/bioinformatics/btz291] [Citation(s) in RCA: 109] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2018] [Revised: 03/18/2019] [Accepted: 04/17/2019] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Contact-map of a protein sequence dictates the global topology of structural fold. Accurate prediction of the contact-map is thus essential to protein 3D structure prediction, which is particularly useful for the protein sequences that do not have close homology templates in the Protein Data Bank. RESULTS We developed a new method, ResPRE, to predict residue-level protein contacts using inverse covariance matrix (or precision matrix) of multiple sequence alignments (MSAs) through deep residual convolutional neural network training. The approach was tested on a set of 158 non-homologous proteins collected from the CASP experiments and achieved an average accuracy of 50.6% in the top-L long-range contact prediction with L being the sequence length, which is 11.7% higher than the best of other state-of-the-art approaches ranging from coevolution coupling analysis to deep neural network training. Detailed data analyses show that the major advantage of ResPRE lies at the utilization of precision matrix that helps rule out transitional noises of contact-maps compared with the previously used covariance matrix. Meanwhile, the residual network with parallel shortcut layer connections increases the learning ability of deep neural network training. It was also found that appropriate collection of MSAs can further improve the accuracy of final contact-map predictions. The standalone package and online server of ResPRE are made freely available, which should bring important impact on protein structure and function modeling studies in particular for the distant- and non-homology protein targets. AVAILABILITY AND IMPLEMENTATION https://zhanglab.ccmb.med.umich.edu/ResPRE and https://github.com/leeyang/ResPRE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Li
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China.,Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Jun Hu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China.,Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| |
Collapse
|
42
|
Zhang GJ, Ma LF, Wang XQ, Zhou XG. Secondary Structure and Contact Guided Differential Evolution for Protein Structure Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1068-1081. [PMID: 30295627 DOI: 10.1109/tcbb.2018.2873691] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Ab initio protein tertiary structure prediction is one of the long-standing problems in structural bioinformatics. With the help of residue-residue contact and secondary structure prediction information, the accuracy of ab initio structure prediction can be enhanced. In this study, an improved differential evolution with secondary structure and residue-residue contact information referred to as SCDE is proposed for protein structure prediction. In SCDE, two score models based on secondary structure and contact information are proposed, and two selection strategies, namely, secondary structure-based selection strategy and contact-based selection strategy, are designed to guide conformation space search. A probability distribution function is designed to balance these two selection strategies. Experimental results on a benchmark dataset with 28 proteins and four free model targets in CASP12 demonstrate that the proposed SCDE is effective and efficient.
Collapse
|
43
|
Feng J, Shukla D. FingerprintContacts: Predicting Alternative Conformations of Proteins from Coevolution. J Phys Chem B 2020; 124:3605-3615. [PMID: 32283936 DOI: 10.1021/acs.jpcb.9b11869] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Proteins are dynamic molecules which perform diverse molecular functions by adopting different three-dimensional structures. Recent progress in residue-residue contacts prediction opens up new avenues for the de novo protein structure prediction from sequence information. However, it is still difficult to predict more than one conformation from residue-residue contacts alone. This is due to the inability to deconvolve the complex signals of residue-residue contacts, i.e., spatial contacts relevant for protein folding, conformational diversity, and ligand binding. Here, we introduce a machine learning based method, called FingerprintContacts, for extending the capabilities of residue-residue contacts. This algorithm leverages the features of residue-residue contacts, that is, (1) a single conformation outperforms the others in the structural prediction using all the top ranking residue-residue contacts as structural constraints and (2) conformation specific contacts rank lower and constitute a small fraction of residue-residue contacts. We demonstrate the capabilities of FingerprintContacts on eight ligand binding proteins with varying conformational motions. Furthermore, FingerprintContacts identifies small clusters of residue-residue contacts which are preferentially located in the dynamically fluctuating regions. With the rapid growth in protein sequence information, we expect FingerprintContacts to be a powerful first step in structural understanding of protein functional mechanisms.
Collapse
|
44
|
Liu S, Xiang X, Gao X, Liu H. Neighborhood Preference of Amino Acids in Protein Structures and its Applications in Protein Structure Assessment. Sci Rep 2020; 10:4371. [PMID: 32152349 PMCID: PMC7062742 DOI: 10.1038/s41598-020-61205-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2019] [Accepted: 02/24/2020] [Indexed: 12/02/2022] Open
Abstract
Amino acids form protein 3D structures in unique manners such that the folded structure is stable and functional under physiological conditions. Non-specific and non-covalent interactions between amino acids exhibit neighborhood preferences. Based on structural information from the protein data bank, a statistical energy function was derived to quantify amino acid neighborhood preferences. The neighborhood of one amino acid is defined by its contacting residues, and the energy function is determined by the neighboring residue types and relative positions. The neighborhood preference of amino acids was exploited to facilitate structural quality assessment, which was implemented in the neighborhood preference program NEPRE. The source codes are available via https://github.com/LiuLab-CSRC/NePre.
Collapse
Affiliation(s)
- Siyuan Liu
- Complex Systems Division, Beijing Computational Science Research Center, Haidian, Beijing, 100193, China
- School of Software Engineering, University of Science and Technology of China, Hefei, Anhui, 230026, China
| | - Xilun Xiang
- Complex Systems Division, Beijing Computational Science Research Center, Haidian, Beijing, 100193, China
- School of Software Engineering, University of Science and Technology of China, Hefei, Anhui, 230026, China
| | - Xiang Gao
- Complex Systems Division, Beijing Computational Science Research Center, Haidian, Beijing, 100193, China
- School of Software Engineering, University of Science and Technology of China, Hefei, Anhui, 230026, China
| | - Haiguang Liu
- Complex Systems Division, Beijing Computational Science Research Center, Haidian, Beijing, 100193, China.
- Physics Department, Beijing Normal University, Haidian, Beijing, 100875, China.
| |
Collapse
|
45
|
Gutmann B, Royan S, Schallenberg-Rüdinger M, Lenz H, Castleden IR, McDowell R, Vacher MA, Tonti-Filippini J, Bond CS, Knoop V, Small ID. The Expansion and Diversification of Pentatricopeptide Repeat RNA-Editing Factors in Plants. MOLECULAR PLANT 2020; 13:215-230. [PMID: 31760160 DOI: 10.1016/j.molp.2019.11.002] [Citation(s) in RCA: 72] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Revised: 10/10/2019] [Accepted: 11/11/2019] [Indexed: 05/08/2023]
Abstract
The RNA-binding pentatricopeptide repeat (PPR) family comprises hundreds to thousands of genes in most plants, but only a few dozen in algae, indicating massive gene expansions during land plant evolution. The nature and timing of these expansions has not been well defined due to the sparse sequence data available from early-diverging land plant lineages. In this study, we exploit the comprehensive OneKP datasets of over 1000 transcriptomes from diverse plants and algae toward establishing a clear picture of the evolution of this massive gene family, focusing on the proteins typically associated with RNA editing, which show the most spectacular variation in numbers and domain composition across the plant kingdom. We characterize over 2 250 000 PPR motifs in over 400 000 proteins. In lycophytes, polypod ferns, and hornworts, nearly 10% of expressed protein-coding genes encode putative PPR editing factors, whereas they are absent from algae and complex-thalloid liverworts. We show that rather than a single expansion, most land plant lineages with high numbers of editing factors have continued to generate novel sequence diversity. We identify sequence variations that imply functional differences between PPR proteins in seed plants versus non-seed plants and variations we propose to be linked to seed-plant-specific editing co-factors. Finally, using the sequence variations across the datasets, we develop a structural model of the catalytic DYW domain associated with C-to-U editing and identify a clade of unique DYW variants that are strong candidates as U-to-C RNA-editing factors, given their phylogenetic distribution and sequence characteristics.
Collapse
Affiliation(s)
- Bernard Gutmann
- Australian Research Council Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, Perth 6009, WA, Australia; School of Molecular Sciences, The University of Western Australia, Crawley, Perth 6009, WA, Australia
| | - Santana Royan
- Australian Research Council Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, Perth 6009, WA, Australia; School of Molecular Sciences, The University of Western Australia, Crawley, Perth 6009, WA, Australia
| | - Mareike Schallenberg-Rüdinger
- IZMB - Institut für Zelluläre und Molekulare Botanik, Abteilung Molekulare Evolution, Universität Bonn, Kirschallee 1, 53115 Bonn, Germany
| | - Henning Lenz
- IZMB - Institut für Zelluläre und Molekulare Botanik, Abteilung Molekulare Evolution, Universität Bonn, Kirschallee 1, 53115 Bonn, Germany
| | - Ian R Castleden
- Australian Research Council Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, Perth 6009, WA, Australia; School of Molecular Sciences, The University of Western Australia, Crawley, Perth 6009, WA, Australia
| | - Rose McDowell
- Australian Research Council Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, Perth 6009, WA, Australia; School of Molecular Sciences, The University of Western Australia, Crawley, Perth 6009, WA, Australia
| | - Michael A Vacher
- Australian Research Council Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, Perth 6009, WA, Australia; School of Molecular Sciences, The University of Western Australia, Crawley, Perth 6009, WA, Australia
| | - Julian Tonti-Filippini
- Australian Research Council Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, Perth 6009, WA, Australia; School of Molecular Sciences, The University of Western Australia, Crawley, Perth 6009, WA, Australia
| | - Charles S Bond
- School of Molecular Sciences, The University of Western Australia, Crawley, Perth 6009, WA, Australia
| | - Volker Knoop
- IZMB - Institut für Zelluläre und Molekulare Botanik, Abteilung Molekulare Evolution, Universität Bonn, Kirschallee 1, 53115 Bonn, Germany
| | - Ian D Small
- Australian Research Council Centre of Excellence in Plant Energy Biology, The University of Western Australia, Crawley, Perth 6009, WA, Australia; School of Molecular Sciences, The University of Western Australia, Crawley, Perth 6009, WA, Australia.
| |
Collapse
|
46
|
Machine learning for protein folding and dynamics. Curr Opin Struct Biol 2020; 60:77-84. [DOI: 10.1016/j.sbi.2019.12.005] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Revised: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 12/17/2022]
|
47
|
Fukuda H, Tomii K. DeepECA: an end-to-end learning framework for protein contact prediction from a multiple sequence alignment. BMC Bioinformatics 2020; 21:10. [PMID: 31918654 PMCID: PMC6953294 DOI: 10.1186/s12859-019-3190-x] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Accepted: 11/04/2019] [Indexed: 12/30/2022] Open
Abstract
Background Recently developed methods of protein contact prediction, a crucially important step for protein structure prediction, depend heavily on deep neural networks (DNNs) and multiple sequence alignments (MSAs) of target proteins. Protein sequences are accumulating to an increasing degree such that abundant sequences to construct an MSA of a target protein are readily obtainable. Nevertheless, many cases present different ends of the number of sequences that can be included in an MSA used for contact prediction. The abundant sequences might degrade prediction results, but opportunities remain for a limited number of sequences to construct an MSA. To resolve these persistent issues, we strove to develop a novel framework using DNNs in an end-to-end manner for contact prediction. Results We developed neural network models to improve precision of both deep and shallow MSAs. Results show that higher prediction accuracy was achieved by assigning weights to sequences in a deep MSA. Moreover, for shallow MSAs, adding a few sequential features was useful to increase the prediction accuracy of long-range contacts in our model. Based on these models, we expanded our model to a multi-task model to achieve higher accuracy by incorporating predictions of secondary structures and solvent-accessible surface areas. Moreover, we demonstrated that ensemble averaging of our models can raise accuracy. Using past CASP target protein domains, we tested our models and demonstrated that our final model is superior to or equivalent to existing meta-predictors. Conclusions The end-to-end learning framework we built can use information derived from either deep or shallow MSAs for contact prediction. Recently, an increasing number of protein sequences have become accessible, including metagenomic sequences, which might degrade contact prediction results. Under such circumstances, our model can provide a means to reduce noise automatically. According to results of tertiary structure prediction based on contacts and secondary structures predicted by our model, more accurate three-dimensional models of a target protein are obtainable than those from existing ECA methods, starting from its MSA. DeepECA is available from https://github.com/tomiilab/DeepECA.
Collapse
Affiliation(s)
- Hiroyuki Fukuda
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba-ken, 277-8562, Japan
| | - Kentaro Tomii
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba-ken, 277-8562, Japan. .,Artificial Intelligence Research Center (AIRC), Biotechnology Research Institute for Drug Discovery, Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
| |
Collapse
|
48
|
Rao R, Bhattacharya N, Thomas N, Duan Y, Chen X, Canny J, Abbeel P, Song YS. Evaluating Protein Transfer Learning with TAPE. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2019; 32:9689-9701. [PMID: 33390682 PMCID: PMC7774645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.
Collapse
|
49
|
Shrestha R, Fajardo E, Gil N, Fidelis K, Kryshtafovych A, Monastyrskyy B, Fiser A. Assessing the accuracy of contact predictions in CASP13. Proteins 2019; 87:1058-1068. [PMID: 31587357 PMCID: PMC6851495 DOI: 10.1002/prot.25819] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Revised: 09/17/2019] [Accepted: 09/17/2019] [Indexed: 01/07/2023]
Abstract
The accuracy of sequence-based tertiary contact predictions was assessed in a blind prediction experiment at the CASP13 meeting. After 4 years of significant improvements in prediction accuracy, another dramatic advance has taken place since CASP12 was held 2 years ago. The precision of predicting the top L/5 contacts in the free modeling category, where L is the corresponding length of the protein in residues, has exceeded 70%. As a comparison, the best-performing group at CASP12 with a 47% precision would have finished below the top 1/3 of the CASP13 groups. Extensively trained deep neural network approaches dominate the top performing algorithms, which appear to efficiently integrate information on coevolving residues and interacting fragments or possibly utilize memories of sequence similarities and sometimes can deliver accurate results even in the absence of virtually any target specific evolutionary information. If the current performance is evaluated by F-score on L contacts, it stands around 24% right now, which, despite the tremendous impact and advance in improving its utility for structure modeling, also suggests that there is much room left for further improvement.
Collapse
Affiliation(s)
- Rojan Shrestha
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Eduardo Fajardo
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Nelson Gil
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Krzysztof Fidelis
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Andriy Kryshtafovych
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Bohdan Monastyrskyy
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Andras Fiser
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| |
Collapse
|
50
|
Zhang H, Zhang Q, Ju F, Zhu J, Gao Y, Xie Z, Deng M, Sun S, Zheng WM, Bu D. Predicting protein inter-residue contacts using composite likelihood maximization and deep learning. BMC Bioinformatics 2019; 20:537. [PMID: 31664895 PMCID: PMC6821021 DOI: 10.1186/s12859-019-3051-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Accepted: 08/22/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate prediction of inter-residue contacts of a protein is important to calculating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective in inferring inter-residue contacts. The Markov random field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is accurate but time-consuming to calculate; in contrast, approximations to the actual likelihood, say pseudo-likelihood, are efficient to calculate but inaccurate. Thus, how to achieve both accuracy and efficiency simultaneously remains a challenge. RESULTS In this study, we present such an approach (called clmDCA) for contact prediction. Unlike plmDCA using pseudo-likelihood, i.e., the product of conditional probability of individual residues, our approach uses composite-likelihood, i.e., the product of conditional probability of all residue pairs. Composite likelihood has been theoretically proved as a better approximation to the actual likelihood function than pseudo-likelihood. Meanwhile, composite likelihood is still efficient to maximize, thus ensuring the efficiency of clmDCA. We present comprehensive experiments on popular benchmark datasets, including PSICOV dataset and CASP-11 dataset, to show that: i) clmDCA alone outperforms the existing MRF-based approaches in prediction accuracy. ii) When equipped with deep learning technique for refinement, the prediction accuracy of clmDCA was further significantly improved, suggesting the suitability of clmDCA for subsequent refinement procedure. We further present a successful application of the predicted contacts to accurately build tertiary structures for proteins in the PSICOV dataset. CONCLUSIONS Composite likelihood maximization algorithm can efficiently estimate the parameters of Markov Random Fields and can improve the prediction accuracy of protein inter-residue contacts.
Collapse
Affiliation(s)
- Haicang Zhang
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Qi Zhang
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Fusong Ju
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Jianwei Zhu
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Yujuan Gao
- Center for Quantitative Biology, School of Mathematical Sciences, Center for Statistical Sciences, Peking University, Beijing, China
| | - Ziwei Xie
- College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | - Minghua Deng
- Center for Quantitative Biology, School of Mathematical Sciences, Center for Statistical Sciences, Peking University, Beijing, China
| | - Shiwei Sun
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.
| | - Wei-Mou Zheng
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing, China.
| | - Dongbo Bu
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. .,University of Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|