101
|
Abdul-Khalek N, Wimmer R, Overgaard MT, Gregersen Echers S. Insight on physicochemical properties governing peptide MS1 response in HPLC-ESI-MS/MS: A deep learning approach. Comput Struct Biotechnol J 2023; 21:3715-3727. [PMID: 37560124 PMCID: PMC10407266 DOI: 10.1016/j.csbj.2023.07.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 07/13/2023] [Accepted: 07/19/2023] [Indexed: 08/11/2023] Open
Abstract
Accurate and absolute quantification of peptides in complex mixtures using quantitative mass spectrometry (MS)-based methods requires foreground knowledge and isotopically labeled standards, thereby increasing analytical expenses, time consumption, and labor, thus limiting the number of peptides that can be accurately quantified. This originates from differential ionization efficiency between peptides and thus, understanding the physicochemical properties that influence the ionization and response in MS analysis is essential for developing less restrictive label-free quantitative methods. Here, we used equimolar peptide pool repository data to develop a deep learning model capable of identifying amino acids influencing the MS1 response. By using an encoder-decoder with an attention mechanism and correlating attention weights with amino acid physicochemical properties, we obtain insight on properties governing the peptide-level MS1 response within the datasets. While the problem cannot be described by one single set of amino acids and properties, distinct patterns were reproducibly obtained. Properties are grouped in three main categories related to peptide hydrophobicity, charge, and structural propensities. Moreover, our model can predict MS1 intensity output under defined conditions based solely on peptide sequence input. Using a refined training dataset, the model predicted log-transformed peptide MS1 intensities with an average error of 9.7 ± 0.5% based on 5-fold cross validation, and outperformed random forest and ridge regression models on both log-transformed and real scale data. This work demonstrates how deep learning can facilitate identification of physicochemical properties influencing peptide MS1 responses, but also illustrates how sequence-based response prediction and label-free peptide-level quantification may impact future workflows within quantitative proteomics.
Collapse
Affiliation(s)
- Naim Abdul-Khalek
- Department of Chemistry and Bioscience, Aalborg University, Aalborg 9220, Denmark
| | - Reinhard Wimmer
- Department of Chemistry and Bioscience, Aalborg University, Aalborg 9220, Denmark
| | | | | |
Collapse
|
102
|
Kabir MWU, Alawad DM, Mishra A, Hoque MT. TAFPred: Torsion Angle Fluctuations Prediction from Protein Sequences. BIOLOGY 2023; 12:1020. [PMID: 37508449 PMCID: PMC10376102 DOI: 10.3390/biology12071020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 07/15/2023] [Accepted: 07/17/2023] [Indexed: 07/30/2023]
Abstract
Protein molecules show varying degrees of flexibility throughout their three-dimensional structures. The flexibility is determined by the fluctuations in torsion angles, specifically phi (φ) and psi (ψ), which define the protein backbone. These angle fluctuations are derived from variations in backbone torsion angles observed in different models. By analyzing the fluctuations in Cartesian coordinate space, we can understand the structural flexibility of proteins. Predicting torsion angle fluctuations is valuable for determining protein function and structure when these angles act as constraints. In this study, a machine learning method called TAFPred is developed to predict torsion angle fluctuations using protein sequences directly. The method incorporates various features, such as disorder probability, position-specific scoring matrix profiles, secondary structure probabilities, and more. TAFPred, employing an optimized Light Gradient Boosting Machine Regressor (LightGBM), achieved high accuracy with correlation coefficients of 0.746 and 0.737 and mean absolute errors of 0.114 and 0.123 for the φ and ψ angles, respectively. Compared to the state-of-the-art method, TAFPred demonstrated significant improvements of 10.08% in MAE and 24.83% in PCC for the phi angle and 9.93% in MAE, and 22.37% in PCC for the psi angle.
Collapse
Affiliation(s)
- Md Wasi Ul Kabir
- Computer Science Department, University of New Orleans, New Orleans, LA 70148, USA
| | - Duaa Mohammad Alawad
- Computer Science Department, University of New Orleans, New Orleans, LA 70148, USA
| | - Avdesh Mishra
- Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, USA
| | - Md Tamjidul Hoque
- Computer Science Department, University of New Orleans, New Orleans, LA 70148, USA
| |
Collapse
|
103
|
Chandra O, Sharma M, Pandey N, Jha IP, Mishra S, Kong SL, Kumar V. Patterns of transcription factor binding and epigenome at promoters allow interpretable predictability of multiple functions of non-coding and coding genes. Comput Struct Biotechnol J 2023; 21:3590-3603. [PMID: 37520281 PMCID: PMC10371796 DOI: 10.1016/j.csbj.2023.07.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 07/05/2023] [Accepted: 07/11/2023] [Indexed: 08/01/2023] Open
Abstract
Understanding the biological roles of all genes only through experimental methods is challenging. A computational approach with reliable interpretability is needed to infer the function of genes, particularly for non-coding RNAs. We have analyzed genomic features that are present across both coding and non-coding genes like transcription factor (TF) and cofactor ChIP-seq (823), histone modifications ChIP-seq (n = 621), cap analysis gene expression (CAGE) tags (n = 255), and DNase hypersensitivity profiles (n = 255) to predict ontology-based functions of genes. Our approach for gene function prediction was reliable (>90% balanced accuracy) for 486 gene-sets. PubMed abstract mining and CRISPR screens supported the inferred association of genes with biological functions, for which our method had high accuracy. Further analysis revealed that TF-binding patterns at promoters have high predictive strength for multiple functions. TF-binding patterns at the promoter add an unexplored dimension of explainable regulatory aspects of genes and their functions. Therefore, we performed a comprehensive analysis for the functional-specificity of TF-binding patterns at promoters and used them for clustering functions to reveal many latent groups of gene-sets involved in common major cellular processes. We also showed how our approach could be used to infer the functions of non-coding genes using the CRISPR screens of coding genes, which were validated using a long non-coding RNA CRISPR screen. Thus our results demonstrated the generality of our approach by using gene-sets from CRISPR screens. Overall, our approach opens an avenue for predicting the involvement of non-coding genes in various functions.
Collapse
Affiliation(s)
- Omkar Chandra
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| | - Madhu Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| | - Neetesh Pandey
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| | - Indra Prakash Jha
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| | - Shreya Mishra
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| | - Say Li Kong
- Genome Institute of Singapore, Agency for Science Technology and Research, Singapore, Singapore
| | - Vibhor Kumar
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Ph-III, New Delhi, India
| |
Collapse
|
104
|
Zheng R, Huang Z, Deng L. Large-scale predicting protein functions through heterogeneous feature fusion. Brief Bioinform 2023:bbad243. [PMID: 37401369 DOI: 10.1093/bib/bbad243] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 05/18/2023] [Accepted: 06/12/2023] [Indexed: 07/05/2023] Open
Abstract
As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.
Collapse
Affiliation(s)
- Rongtao Zheng
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| | - Zhijian Huang
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| |
Collapse
|
105
|
Boadu F, Cao H, Cheng J. Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Bioinformatics 2023; 39:i318-i325. [PMID: 37387145 DOI: 10.1093/bioinformatics/btad208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently. RESULTS We developed TransFun-a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy. AVAILABILITY AND IMPLEMENTATION The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, United States
| | - Hongyuan Cao
- Department of Statistics, Florida State University, Tallahassee, FL 32306, Unites States
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, United States
| |
Collapse
|
106
|
Adhav V, Saikrishnan K. The Realm of Unconventional Noncovalent Interactions in Proteins: Their Significance in Structure and Function. ACS OMEGA 2023; 8:22268-22284. [PMID: 37396257 PMCID: PMC10308531 DOI: 10.1021/acsomega.3c00205] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 05/22/2023] [Indexed: 07/04/2023]
Abstract
Proteins and their assemblies are fundamental for living cells to function. Their complex three-dimensional architecture and its stability are attributed to the combined effect of various noncovalent interactions. It is critical to scrutinize these noncovalent interactions to understand their role in the energy landscape in folding, catalysis, and molecular recognition. This Review presents a comprehensive summary of unconventional noncovalent interactions, beyond conventional hydrogen bonds and hydrophobic interactions, which have gained prominence over the past decade. The noncovalent interactions discussed include low-barrier hydrogen bonds, C5 hydrogen bonds, C-H···π interactions, sulfur-mediated hydrogen bonds, n → π* interactions, London dispersion interactions, halogen bonds, chalcogen bonds, and tetrel bonds. This Review focuses on their chemical nature, interaction strength, and geometrical parameters obtained from X-ray crystallography, spectroscopy, bioinformatics, and computational chemistry. Also highlighted are their occurrence in proteins or their complexes and recent advances made toward understanding their role in biomolecular structure and function. Probing the chemical diversity of these interactions, we determined that the variable frequency of occurrence in proteins and the ability to synergize with one another are important not only for ab initio structure prediction but also to design proteins with new functionalities. A better understanding of these interactions will promote their utilization in designing and engineering ligands with potential therapeutic value.
Collapse
Affiliation(s)
- Vishal
Annasaheb Adhav
- Department of Biology, Indian Institute of Science Education and Research, Pune 411008, India
| | - Kayarat Saikrishnan
- Department of Biology, Indian Institute of Science Education and Research, Pune 411008, India
| |
Collapse
|
107
|
Nordquist E, Zhang G, Barethiya S, Ji N, White KM, Han L, Jia Z, Shi J, Cui J, Chen J. Incorporating physics to overcome data scarcity in predictive modeling of protein function: a case study of BK channels. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.24.546384. [PMID: 37425916 PMCID: PMC10327070 DOI: 10.1101/2023.06.24.546384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
Machine learning has played transformative roles in numerous chemical and biophysical problems such as protein folding where large amount of data exists. Nonetheless, many important problems remain challenging for data-driven machine learning approaches due to the limitation of data scarcity. One approach to overcome data scarcity is to incorporate physical principles such as through molecular modeling and simulation. Here, we focus on the big potassium (BK) channels that play important roles in cardiovascular and neural systems. Many mutants of BK channel are associated with various neurological and cardiovascular diseases, but the molecular effects are unknown. The voltage gating properties of BK channels have been characterized for 473 site-specific mutations experimentally over the last three decades; yet, these functional data by themselves remain far too sparse to derive a predictive model of BK channel voltage gating. Using physics-based modeling, we quantify the energetic effects of all single mutations on both open and closed states of the channel. Together with dynamic properties derived from atomistic simulations, these physical descriptors allow the training of random forest models that could reproduce unseen experimentally measured shifts in gating voltage, ΔV 1/2 , with a RMSE ∼ 32 mV and correlation coefficient of R ∼ 0.7. Importantly, the model appears capable of uncovering nontrivial physical principles underlying the gating of the channel, including a central role of hydrophobic gating. The model was further evaluated using four novel mutations of L235 and V236 on the S5 helix, mutations of which are predicted to have opposing effects on V 1/2 and suggest a key role of S5 in mediating voltage sensor-pore coupling. The measured ΔV 1/2 agree quantitatively with prediction for all four mutations, with a high correlation of R = 0.92 and RMSE = 18 mV. Therefore, the model can capture nontrivial voltage gating properties in regions where few mutations are known. The success of predictive modeling of BK voltage gating demonstrates the potential of combining physics and statistical learning for overcoming data scarcity in nontrivial protein function prediction. Author Summary Deep machine learning has brought many exciting breakthroughs in chemistry, physics and biology. These models require large amount of training data and struggle when the data is scarce. The latter is true for predictive modeling of the function of complex proteins such as ion channels, where only hundreds of mutational data may be available. Using the big potassium (BK) channel as a biologically important model system, we demonstrate that a reliable predictive model of its voltage gating property could be derived from only 473 mutational data by incorporating physics-derived features, which include dynamic properties from molecular dynamics simulations and energetic quantities from Rosetta mutation calculations. We show that the final random forest model captures key trends and hotspots in mutational effects of BK voltage gating, such as the important role of pore hydrophobicity. A particularly curious prediction is that mutations of two adjacent residues on the S5 helix would always have opposite effects on the gating voltage, which was confirmed by experimental characterization of four novel mutations. The current work demonstrates the importance and effectiveness of incorporating physics in predictive modeling of protein function with scarce data.
Collapse
Affiliation(s)
- Erik Nordquist
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Guohui Zhang
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Shrishti Barethiya
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Nathan Ji
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, USA
| | - Kelli M White
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Lu Han
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Zhiguang Jia
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Jingyi Shi
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Jianmin Cui
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Jianhan Chen
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| |
Collapse
|
108
|
Li H, Liu B. BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo. PLoS Comput Biol 2023; 19:e1011214. [PMID: 37339155 DOI: 10.1371/journal.pcbi.1011214] [Citation(s) in RCA: 55] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Accepted: 05/24/2023] [Indexed: 06/22/2023] Open
Abstract
As the key for biological sequence structure and function prediction, disease diagnosis and treatment, biological sequence similarity analysis has attracted more and more attentions. However, the exiting computational methods failed to accurately analyse the biological sequence similarities because of the various data types (DNA, RNA, protein, disease, etc) and their low sequence similarities (remote homology). Therefore, new concepts and techniques are desired to solve this challenging problem. Biological sequences (DNA, RNA and protein sequences) can be considered as the sentences of "the book of life", and their similarities can be considered as the biological language semantics (BLS). In this study, we are seeking the semantics analysis techniques derived from the natural language processing (NLP) to comprehensively and accurately analyse the biological sequence similarities. 27 semantics analysis methods derived from NLP were introduced to analyse biological sequence similarities, bringing new concepts and techniques to biological sequence similarity analysis. Experimental results show that these semantics analysis methods are able to facilitate the development of protein remote homology detection, circRNA-disease associations identification and protein function annotation, achieving better performance than the other state-of-the-art predictors in the related fields. Based on these semantics analysis methods, a platform called BioSeq-Diabolo has been constructed, which is named after a popular traditional sport in China. The users only need to input the embeddings of the biological sequence data. BioSeq-Diabolo will intelligently identify the task, and then accurately analyse the biological sequence similarities based on biological language semantics. BioSeq-Diabolo will integrate different biological sequence similarities in a supervised manner by using Learning to Rank (LTR), and the performance of the constructed methods will be evaluated and analysed so as to recommend the best methods for the users. The web server and stand-alone package of BioSeq-Diabolo can be accessed at http://bliulab.net/BioSeq-Diabolo/server/.
Collapse
Affiliation(s)
- Hongliang Li
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
109
|
Oliveira GB, Pedrini H, Dias Z. TEMPROT: protein function annotation using transformers embeddings and homology search. BMC Bioinformatics 2023; 24:242. [PMID: 37291492 DOI: 10.1186/s12859-023-05375-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 06/02/2023] [Indexed: 06/10/2023] Open
Abstract
BACKGROUND Although the development of sequencing technologies has provided a large number of protein sequences, the analysis of functions that each one plays is still difficult due to the efforts of laboratorial methods, making necessary the usage of computational methods to decrease this gap. As the main source of information available about proteins is their sequences, approaches that can use this information, such as classification based on the patterns of the amino acids and the inference based on sequence similarity using alignment tools, are able to predict a large collection of proteins. The methods available in the literature that use this type of feature can achieve good results, however, they present restrictions of protein length as input to their models. In this work, we present a new method, called TEMPROT, based on the fine-tuning and extraction of embeddings from an available architecture pre-trained on protein sequences. We also describe TEMPROT+, an ensemble between TEMPROT and BLASTp, a local alignment tool that analyzes sequence similarity, which improves the results of our former approach. RESULTS The evaluation of our proposed classifiers with the literature approaches has been conducted on our dataset, which was derived from CAFA3 challenge database. Both TEMPROT and TEMPROT+ achieved competitive results on [Formula: see text], [Formula: see text], AuPRC and IAuPRC metrics on Biological Process (BP), Cellular Component (CC) and Molecular Function (MF) ontologies compared to state-of-the-art models, with the main results equal to 0.581, 0.692 and 0.662 of [Formula: see text] on BP, CC and MF, respectively. CONCLUSIONS The comparison with the literature showed that our model presented competitive results compared the state-of-the-art approaches considering the amino acid sequence pattern recognition and homology analysis. Our model also presented improvements related to the input size that the model can use to train compared to the literature methods.
Collapse
Affiliation(s)
| | - Helio Pedrini
- Institute of Computing, University of Campinas, Campinas, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
110
|
Suwanchaikasem P, Nie S, Idnurm A, Selby‐Pham J, Walker R, Boughton BA. Effects of chitin and chitosan on root growth, biochemical defense response and exudate proteome of Cannabis sativa. PLANT-ENVIRONMENT INTERACTIONS (HOBOKEN, N.J.) 2023; 4:115-133. [PMID: 37362423 PMCID: PMC10290428 DOI: 10.1002/pei3.10106] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 03/09/2023] [Accepted: 03/19/2023] [Indexed: 06/28/2023]
Abstract
Fungal pathogens pose a major threat to Cannabis sativa production, requiring safe and effective management procedures to control disease. Chitin and chitosan are natural molecules that elicit plant defense responses. Investigation of their effects on C. sativa will advance understanding of plant responses towards elicitors and provide a potential pathway to enhance plant resistance against diseases. Plants were grown in the in vitro Root-TRAPR system and treated with colloidal chitin and chitosan. Plant morphology was monitored, then plant tissues and exudates were collected for enzymatic activity assays, phytohormone quantification, qPCR analysis and proteomics profiling. Chitosan treatments showed increased total chitinase activity and expression of pathogenesis-related (PR) genes by 3-5 times in the root tissues. In the exudates, total peroxidase and chitinase activities and levels of defense proteins such as PR protein 1 and endochitinase 2 were increased. Shoot development was unaffected, but root development was inhibited after chitosan exposure. In contrast, chitin treatments had no significant impact on any defense parameters, including enzymatic activities, hormone quantities, gene expression levels and root secreted proteins. These results indicate that colloidal chitosan, significantly enhancing defense responses in C. sativa root system, could be used as a potential elicitor, particularly in hydroponic scenarios to manage crop diseases.
Collapse
Affiliation(s)
| | - Shuai Nie
- Mass Spectrometry and Proteomics Facility, Bio21 Molecular Science and Biotechnology InstituteUniversity of MelbourneMelbourneVictoria3052Australia
| | - Alexander Idnurm
- School of BioSciencesUniversity of MelbourneMelbourneVictoria3010Australia
| | - Jamie Selby‐Pham
- School of BioSciencesUniversity of MelbourneMelbourneVictoria3010Australia
- Cannabis and Biostimulants Research Group Pty LtdMelbourneVictoria3020Australia
| | - Robert Walker
- School of BioSciencesUniversity of MelbourneMelbourneVictoria3010Australia
| | - Berin A. Boughton
- School of BioSciencesUniversity of MelbourneMelbourneVictoria3010Australia
- Australian National Phenome CentreMurdoch UniversityPerthWestern Australia6150Australia
| |
Collapse
|
111
|
Watanabe N, Kuriya Y, Murata M, Yamamoto M, Shimizu M, Araki M. Different Recognition of Protein Features Depending on Deep Learning Models: A Case Study of Aromatic Decarboxylase UbiD. BIOLOGY 2023; 12:795. [PMID: 37372080 DOI: 10.3390/biology12060795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 05/17/2023] [Accepted: 05/29/2023] [Indexed: 06/29/2023]
Abstract
The number of unannotated protein sequences is explosively increasing due to genome sequence technology. A more comprehensive understanding of protein functions for protein annotation requires the discovery of new features that cannot be captured from conventional methods. Deep learning can extract important features from input data and predict protein functions based on the features. Here, protein feature vectors generated by 3 deep learning models are analyzed using Integrated Gradients to explore important features of amino acid sites. As a case study, prediction and feature extraction models for UbiD enzymes were built using these models. The important amino acid residues extracted from the models were different from secondary structures, conserved regions and active sites of known UbiD information. Interestingly, the different amino acid residues within UbiD sequences were regarded as important factors depending on the type of models and sequences. The Transformer models focused on more specific regions than the other models. These results suggest that each deep learning model understands protein features with different aspects from existing knowledge and has the potential to discover new laws of protein functions. This study will help to extract new protein features for the other protein annotations.
Collapse
Affiliation(s)
- Naoki Watanabe
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition, 3-17 Senrioka-shinmachi, Settsu 566-0002, Japan
| | - Yuki Kuriya
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition, 3-17 Senrioka-shinmachi, Settsu 566-0002, Japan
| | - Masahiro Murata
- Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai, Nada-Ku, Kobe 657-8501, Japan
| | - Masaki Yamamoto
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition, 3-17 Senrioka-shinmachi, Settsu 566-0002, Japan
| | - Masayuki Shimizu
- Bacchus Bio Innovation Co., Ltd., 6-3-7 Minatojima minami-machi, Kobe 650-0047, Japan
| | - Michihiro Araki
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition, 3-17 Senrioka-shinmachi, Settsu 566-0002, Japan
- Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai, Nada-Ku, Kobe 657-8501, Japan
- Graduate School of Medicine, Kyoto University, 54 Shogoin-Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan
- National Cerebral and Cardiovascular Center, 6-1 Kishibe-Shinmachi, Suita 564-8565, Japan
| |
Collapse
|
112
|
Wang Z, Deng Z, Zhang W, Lou Q, Choi KS, Wei Z, Wang L, Wu J. MMSMAPlus: a multi-view multi-scale multi-attention embedding model for protein function prediction. Brief Bioinform 2023:7187109. [PMID: 37258453 DOI: 10.1093/bib/bbad201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 04/16/2023] [Accepted: 05/08/2023] [Indexed: 06/02/2023] Open
Abstract
Protein is the most important component in organisms and plays an indispensable role in life activities. In recent years, a large number of intelligent methods have been proposed to predict protein function. These methods obtain different types of protein information, including sequence, structure and interaction network. Among them, protein sequences have gained significant attention where methods are investigated to extract the information from different views of features. However, how to fully exploit the views for effective protein sequence analysis remains a challenge. In this regard, we propose a multi-view, multi-scale and multi-attention deep neural model (MMSMA) for protein function prediction. First, MMSMA extracts multi-view features from protein sequences, including one-hot encoding features, evolutionary information features, deep semantic features and overlapping property features based on physiochemistry. Second, a specific multi-scale multi-attention deep network model (MSMA) is built for each view to realize the deep feature learning and preliminary classification. In MSMA, both multi-scale local patterns and long-range dependence from protein sequences can be captured. Third, a multi-view adaptive decision mechanism is developed to make a comprehensive decision based on the classification results of all the views. To further improve the prediction performance, an extended version of MMSMA, MMSMAPlus, is proposed to integrate homology-based protein prediction under the framework of multi-view deep neural model. Experimental results show that the MMSMAPlus has promising performance and is significantly superior to the state-of-the-art methods. The source code can be found at https://github.com/wzy-2020/MMSMAPlus.
Collapse
Affiliation(s)
- Zhongyu Wang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Wei Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Qiongdan Lou
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | | | - Zhisheng Wei
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| | - Lei Wang
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| | - Jing Wu
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| |
Collapse
|
113
|
Sun S, Wang M, Xiang J, Shao Y, Li L, Sedjoah RCAA, Wu G, Zhou J, Xin Z. BON domain-containing protein-mediated co-selection of antibiotic and heavy metal resistance in bacteria. Int J Biol Macromol 2023; 238:124062. [PMID: 36933600 DOI: 10.1016/j.ijbiomac.2023.124062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 03/12/2023] [Accepted: 03/13/2023] [Indexed: 03/17/2023]
Abstract
The widespread antibiotic resistance of bacteria has become one of the most severe threats to public health. However, the mechanisms that allow microbial acquisition of resistance are still poorly understood. In the present study, a novel BON domain-containing protein was heterologously expressed in Escherichia coli. It functions as an efflux pump-like to confer resistance to various antibiotics, especially for ceftazidime, with a >32-fold increase in minimum inhibitory concentration (MIC). The fluorescence spectroscopy experiment indicated that BON protein could interact with several metal ions, such as copper and silver, which has been associated with the induced co-regulation of antibiotic and heavy metal resistance in bacteria. Furthermore, the BON protein was demonstrated to spontaneously self-assemble into a trimer and generate a central pore-like architecture for antibiotic transporting. A WXG motif as a molecular switch is essential for forming the transmembrane oligomeric pores and controls the interaction between BON protein and cell membrane. Based on these findings, a mechanism termed "one-in, one-out", was proposed for the first time. The present study provides new insights into the structure and function of BON protein and a previously unidentified antibiotic resistance mechanism, filling the knowledge gap in understanding BON protein-mediated intrinsic antibiotic resistance.
Collapse
Affiliation(s)
- Shengwei Sun
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Mengxi Wang
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Jiahui Xiang
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Yuting Shao
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Longxiang Li
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Rita-Cindy Aye-Ayire Sedjoah
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Guojun Wu
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Jingjie Zhou
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Zhihong Xin
- Key Laboratory of Food Processing and Quality Control, College of Food Science and Technology, Nanjing Agricultural University, Nanjing 210095, PR China.
| |
Collapse
|
114
|
Ibtehaz N, Sourav SMSH, Bayzid MS, Rahman MS. Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 2023; 42:135-146. [PMID: 36977849 DOI: 10.1007/s10930-023-10096-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2023] [Indexed: 03/29/2023]
Abstract
The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
Collapse
|
115
|
Wu Z, Guo M, Jin X, Chen J, Liu B. CFAGO: cross-fusion of network and attributes based on attention mechanism for protein function prediction. Bioinformatics 2023; 39:7072461. [PMID: 36883697 PMCID: PMC10032634 DOI: 10.1093/bioinformatics/btad123] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 02/28/2023] [Accepted: 03/05/2023] [Indexed: 03/09/2023] Open
Abstract
MOTIVATION Protein function annotation is fundamental to understanding biological mechanisms. The abundant genome-scale protein-protein interaction (PPI) networks, together with other protein biological attributes, provide rich information for annotating protein functions. As PPI networks and biological attributes describe protein functions from different perspectives, it is highly challenging to cross-fuse them for protein function prediction. Recently, several methods combine the PPI networks and protein attributes via the graph neural networks (GNNs). However, GNNs may inherit or even magnify the bias caused by noisy edges in PPI networks. Besides, GNNs with stacking of many layers may cause the over-smoothing problem of node representations. RESULTS We develop a novel protein function prediction method, CFAGO, to integrate single-species PPI networks and protein biological attributes via a multi-head attention mechanism. CFAGO is first pre-trained with an encoder-decoder architecture to capture the universal protein representation of the two sources. It is then fine-tuned to learn more effective protein representations for protein function prediction. Benchmark experiments on human and mouse datasets show CFAGO outperforms state-of-the-art single-species network-based methods by at least 7.59%, 6.90%, 11.68% in terms of m-AUPR, M-AUPR, and Fmax, respectively, demonstrating cross-fusion by multi-head attention mechanism can greatly improve the protein function prediction. We further evaluate the quality of captured protein representations in terms of Davies Bouldin Score, whose results show that cross-fused protein representations by multi-head attention mechanism are at least 2.7% better than that of original and concatenated representations. We believe CFAGO is an effective tool for protein function prediction. AVAILABILITY AND IMPLEMENTATION The source code of CFAGO and experiments data are available at: http://bliulab.net/CFAGO/.
Collapse
Affiliation(s)
- Zhourun Wu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Mingyue Guo
- School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong 518118, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
116
|
Li M, Shi W, Zhang F, Zeng M, Li Y. A Deep Learning Framework for Predicting Protein Functions With Co-Occurrence of GO Terms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:833-842. [PMID: 35476573 DOI: 10.1109/tcbb.2022.3170719] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The understanding of protein functions is critical to many biological problems such as the development of new drugs and new crops. To reduce the huge gap between the increase of protein sequences and annotations of protein functions, many methods have been proposed to deal with this problem. These methods use Gene Ontology (GO) to classify the functions of proteins and consider one GO term as a class label. However, they ignore the co-occurrence of GO terms that is helpful for protein function prediction. We propose a new deep learning model, named DeepPFP-CO, which uses Graph Convolutional Network (GCN) to explore and capture the co-occurrence of GO terms to improve the protein function prediction performance. In this way, we can further deduce the protein functions by fusing the predicted propensity of the center function and its co-occurrence functions. We use Fmax and AUPR to evaluate the performance of DeepPFP-CO and compare DeepPFP-CO with state-of-the-art methods such as DeepGOPlus and DeepGOA. The computational results show that DeepPFP-CO outperforms DeepGOPlus and other methods. Moreover, we further analyze our model at the protein level. The results have demonstrated that DeepPFP-CO improves the performance of protein function prediction. DeepPFP-CO is available at https://csuligroup.com/DeepPFP/.
Collapse
|
117
|
Ranjan A, Fahad MS, Fernandez-Baca D, Tripathi S, Deepak A. MCWS-Transformers: Towards an Efficient Modeling of Protein Sequences via Multi Context-Window Based Scaled Self-Attention. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1188-1199. [PMID: 35536815 DOI: 10.1109/tcbb.2022.3173789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
This paper advances the self-attention mechanism in the standard transformer network specific to the modeling of the protein sequences. We introduce a novel context-window based scaled self-attention mechanism for processing protein sequences that is based on the notion of (i) local context and (ii) large contextual pattern. Both notions are essential to building a good representation for protein sequences. The proposed context-window based scaled self-attention mechanism is further used to build the multi context-window based scaled (MCWS) transformer network for the protein function prediction task at the protein sub-sequence level. Overall, the proposed MCWS transformer network produced improved predictive performances, outperforming existing state-of-the-art approaches by substantial margins. With respect to the standard transformer network, the proposed network produced improvements in F1-score of +2.30% and +2.08% on the biological process (BP) and molecular function (MF) datasets, respectively. The corresponding improvements over the state-of-the-art ProtVecGen-Plus+ProtVecGen-Ensemble approach are +3.38% (BP) and +2.86% (MF). Equally important, robust performances were obtained across protein sequences of different lengths.
Collapse
|
118
|
Di Persia L, Lopez T, Arce A, Milone DH, Stegmayer G. exp2GO: Improving Prediction of Functions in the Gene Ontology With Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:999-1008. [PMID: 35417352 DOI: 10.1109/tcbb.2022.3167245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The computational methods for the prediction of gene function annotations aim to automatically find associations between a gene and a set of Gene Ontology (GO) terms describing its functions. Since the hand-made curation process of novel annotations and the corresponding wet experiments validations are very time-consuming and costly procedures, there is a need for computational tools that can reliably predict likely annotations and boost the discovery of new gene functions. This work proposes a novel method for predicting annotations based on the inference of GO similarities from expression similarities. The novel method was benchmarked against other methods on several public biological datasets, obtaining the best comparative results. exp2GO effectively improved the prediction of GO annotations in comparison to state-of-the-art methods. Furthermore, the proposal was validated with a full genome case where it was capable of predicting relevant and accurate biological functions. The repository of this project withh full data and code is available at https://github.com/sinc-lab/exp2GO.
Collapse
|
119
|
MultiScale-CNN-4mCPred: a multi-scale CNN and adaptive embedding-based method for mouse genome DNA N4-methylcytosine prediction. BMC Bioinformatics 2023; 24:21. [PMID: 36653789 PMCID: PMC9847203 DOI: 10.1186/s12859-023-05135-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 01/04/2023] [Indexed: 01/19/2023] Open
Abstract
N4-methylcytosine (4mC) is an important epigenetic mechanism, which regulates many cellular processes such as cell differentiation and gene expression. The knowledge about the 4mC sites is a key foundation to exploring its roles. Due to the limitation of techniques, precise detection of 4mC is still a challenging task. In this paper, we presented a multi-scale convolution neural network (CNN) and adaptive embedding-based computational method for predicting 4mC sites in mouse genome, which was referred to as MultiScale-CNN-4mCPred. The MultiScale-CNN-4mCPred used adaptive embedding to encode nucleotides, and then utilized multi-scale CNNs as well as long short-term memory to extract more in-depth local properties and contextual semantics in the sequences. The MultiScale-CNN-4mCPred is an end-to-end learning method, which requires no sophisticated feature design. The MultiScale-CNN-4mCPred reached an accuracy of 81.66% in the 10-fold cross-validation, and an accuracy of 84.69% in the independent test, outperforming state-of-the-art methods. We implemented the proposed method into a user-friendly web application which is freely available at: http://www.biolscience.cn/MultiScale-CNN-4mCPred/ .
Collapse
|
120
|
Ardern Z, Chakraborty S, Lenk F, Kaster AK. Elucidating the functional roles of prokaryotic proteins using big data and artificial intelligence. FEMS Microbiol Rev 2023; 47:fuad003. [PMID: 36725215 PMCID: PMC9960493 DOI: 10.1093/femsre/fuad003] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 01/11/2023] [Accepted: 01/31/2023] [Indexed: 02/03/2023] Open
Abstract
Annotating protein sequences according to their biological functions is one of the key steps in understanding microbial diversity, metabolic potentials, and evolutionary histories. However, even in the best-studied prokaryotic genomes, not all proteins can be characterized by classical in vivo, in vitro, and/or in silico methods-a challenge rapidly growing alongside the advent of next-generation sequencing technologies and their enormous extension of 'omics' data in public databases. These so-called hypothetical proteins (HPs) represent a huge knowledge gap and hidden potential for biotechnological applications. Opportunities for leveraging the available 'Big Data' have recently proliferated with the use of artificial intelligence (AI). Here, we review the aims and methods of protein annotation and explain the different principles behind machine and deep learning algorithms including recent research examples, in order to assist both biologists wishing to apply AI tools in developing comprehensive genome annotations and computer scientists who want to contribute to this leading edge of biological research.
Collapse
Affiliation(s)
- Zachary Ardern
- Institute for Biological Interfaces 5 (Institut für Biologische Grenzflächen IBG 5), Karlsruhe Institute of Technology (KIT), 76344 Eggenstein-Leopoldshafen, Germany
- Wellcome Trust Sanger Institute, Hinxton, Saffron Walden CB10 1RQ, United Kingdom
| | - Sagarika Chakraborty
- Institute for Biological Interfaces 5 (Institut für Biologische Grenzflächen IBG 5), Karlsruhe Institute of Technology (KIT), 76344 Eggenstein-Leopoldshafen, Germany
| | - Florian Lenk
- Institute for Biological Interfaces 5 (Institut für Biologische Grenzflächen IBG 5), Karlsruhe Institute of Technology (KIT), 76344 Eggenstein-Leopoldshafen, Germany
| | - Anne-Kristin Kaster
- Institute for Biological Interfaces 5 (Institut für Biologische Grenzflächen IBG 5), Karlsruhe Institute of Technology (KIT), 76344 Eggenstein-Leopoldshafen, Germany
| |
Collapse
|
121
|
Ranjan A, Tiwari A, Deepak A. A Sub-Sequence Based Approach to Protein Function Prediction via Multi-Attention Based Multi-Aspect Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:94-105. [PMID: 34826296 DOI: 10.1109/tcbb.2021.3130923] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Inferring the protein function(s) via the protein sub-sequence classification is often obstructed due to lack of knowledge about function(s) of sub-sequences in the protein sequence. In this regard, we develop a novel "multi-aspect" paradigm to perform the sub-sequence classification in an efficient way by utilizing the information of the parent sequence. The aspects are: (1) Multi-label: independent labelling of sub-sequences with more than one functions of the parent sequence, and (ii) Label-relevance: scoring the parent functions to highlight the relevance of performing a given function by the sub-sequence. The multi-aspect paradigm is used to propose the "Multi-Attention Based Multi-Aspect Network" for classifying the protein sub-sequences, where multi-attention is a novel approach to process sub-sequences at word-level. Next, the proposed Global-ProtEnc method is a sub-sequence based approach to encoding protein sequences for protein function prediction task, which is finally used to develop as ensemble methods, Global-ProtEnc-Plus. Evaluations of both the Global-ProtEnc and the Global-ProtEnc-Plus methods on the benchmark CAFA3 dataset delivered a outstanding performances. Compared to the state-of-the-art DeepGOPlus, the improvements in Fmax with the Global-ProtEnc-Plus for the biological process is +6.50 percent and cellular component is +1.90 percent.
Collapse
|
122
|
Tharmakulasingam M, Gardner B, La Ragione R, Fernando A. Rectified Classifier Chains for Prediction of Antibiotic Resistance From Multi-Labelled Data With Missing Labels. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:625-636. [PMID: 35130168 DOI: 10.1109/tcbb.2022.3148577] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Predicting Antimicrobial Resistance (AMR) from genomic data has important implications for human and animal healthcare, and especially given its potential for more rapid diagnostics and informed treatment choices. With the recent advances in sequencing technologies, applying machine learning techniques for AMR prediction have indicated promising results. Despite this, there are shortcomings in the literature concerning methodologies suitable for multi-drug AMR prediction and especially where samples with missing labels exist. To address this shortcoming, we introduce a Rectified Classifier Chain (RCC) method for predicting multi-drug resistance. This RCC method was tested using annotated features of genomics sequences and compared with similar multi-label classification methodologies. We found that applying the eXtreme Gradient Boosting (XGBoost) base model to our RCC model outperformed the second-best model, XGBoost based binary relevance model, by 3.3% in Hamming accuracy and 7.8% in F1-score. Additionally, we note that in the literature machine learning models applied to AMR prediction typically are unsuitable for identifying biomarkers informative of their decisions; in this study, we show that biomarkers contributing to AMR prediction can also be identified using the proposed RCC method. We expect this can facilitate genome annotation and pave the path towards identifying new biomarkers indicative of AMR.
Collapse
|
123
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction. Stat Appl Genet Mol Biol 2023; 22:sagmb-2022-0057. [PMID: 37658681 DOI: 10.1515/sagmb-2022-0057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 04/20/2023] [Indexed: 09/03/2023]
Abstract
Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, ITER, Siksha 'O' Anusandhan University (Deemed to be University), Bhubaneswar, Odisha, India
| | | |
Collapse
|
124
|
Kabir A, Shehu A. GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction. Biomolecules 2022; 12:1709. [PMID: 36421723 PMCID: PMC9687818 DOI: 10.3390/biom12111709] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 11/14/2022] [Accepted: 11/15/2022] [Indexed: 09/19/2023] Open
Abstract
Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feature space. A graph transformer learns a representation of GO terms while respecting their hierarchical relationships. The learned sequence and GO terms representations are combined and utilized for multi-label classification, with the labels corresponding to GO terms. The method is shown superior over recent representative GO prediction methods. The second major contribution in this paper is a deep investigation of different ways of constructing training and testing datasets. The paper shows that existing approaches under- or over-estimate the generalization power of a model. A novel approach is proposed to address these issues, resulting in a new benchmark dataset to rigorously evaluate and compare methods and advance the state-of-the-art.
Collapse
Affiliation(s)
- Anowarul Kabir
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
- Center for Advancing Human-Machine Partnerships, George Mason University, Fairfax, VA 22030, USA
- Department of Bioengineering, George Mason University, Fairfax, VA 22030, USA
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA
| |
Collapse
|
125
|
Berger J, Legendre F, Zelosko KM, Harrison MC, Grandcolas P, Bornberg-Bauer E, Fouks B. Eusocial Transition in Blattodea: Transposable Elements and Shifts of Gene Expression. Genes (Basel) 2022; 13:1948. [PMID: 36360186 PMCID: PMC9689775 DOI: 10.3390/genes13111948] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 10/19/2022] [Accepted: 10/20/2022] [Indexed: 11/07/2023] Open
Abstract
(1) Unravelling the molecular basis underlying major evolutionary transitions can shed light on how complex phenotypes arise. The evolution of eusociality, a major evolutionary transition, has been demonstrated to be accompanied by enhanced gene regulation. Numerous pieces of evidence suggest the major impact of transposon insertion on gene regulation and its role in adaptive evolution. Transposons have been shown to be play a role in gene duplication involved in the eusocial transition in termites. However, evidence of the molecular basis underlying the eusocial transition in Blattodea remains scarce. Could transposons have facilitated the eusocial transition in termites through shifts of gene expression? (2) Using available cockroach and termite genomes and transcriptomes, we investigated if transposons insert more frequently in genes with differential expression in queens and workers and if those genes could be linked to specific functions essential for eusocial transition. (3) The insertion rate of transposons differs among differentially expressed genes and displays opposite trends between termites and cockroaches. The functions of termite transposon-rich queen- and worker-biased genes are related to reproduction and ageing and behaviour and gene expression, respectively. (4) Our study provides further evidence on the role of transposons in the evolution of eusociality, potentially through shifts in gene expression.
Collapse
Affiliation(s)
- Juliette Berger
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum National d’Histoire Naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, CP50, 57 rue Cuvier, 75005 Paris, France
| | - Frédéric Legendre
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum National d’Histoire Naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, CP50, 57 rue Cuvier, 75005 Paris, France
| | - Kevin-Markus Zelosko
- Institute for Evolution and Biodiversity, Molecular Evolution and Bioinformatics, Westfälische Wilhelms-Universität, Hüfferstrasse 1, 48149 Münster, Germany
| | - Mark C. Harrison
- Institute for Evolution and Biodiversity, Molecular Evolution and Bioinformatics, Westfälische Wilhelms-Universität, Hüfferstrasse 1, 48149 Münster, Germany
| | - Philippe Grandcolas
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum National d’Histoire Naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, CP50, 57 rue Cuvier, 75005 Paris, France
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, Molecular Evolution and Bioinformatics, Westfälische Wilhelms-Universität, Hüfferstrasse 1, 48149 Münster, Germany
- Department of Protein Evolution, Max Planck Institute for Developmental Biology, Max-Planck-Ring 5, 72076 Tübingen, Germany
| | - Bertrand Fouks
- Institute for Evolution and Biodiversity, Molecular Evolution and Bioinformatics, Westfälische Wilhelms-Universität, Hüfferstrasse 1, 48149 Münster, Germany
| |
Collapse
|
126
|
Sengupta K, Saha S, Halder AK, Chatterjee P, Nasipuri M, Basu S, Plewczynski D. PFP-GO: Integrating protein sequence, domain and protein-protein interaction information for protein function prediction using ranked GO terms. Front Genet 2022; 13:969915. [PMID: 36246645 PMCID: PMC9556876 DOI: 10.3389/fgene.2022.969915] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 08/31/2022] [Indexed: 11/13/2022] Open
Abstract
Protein function prediction is gradually emerging as an essential field in biological and computational studies. Though the latter has clinched a significant footprint, it has been observed that the application of computational information gathered from multiple sources has more significant influence than the one derived from a single source. Considering this fact, a methodology, PFP-GO, is proposed where heterogeneous sources like Protein Sequence, Protein Domain, and Protein-Protein Interaction Network have been processed separately for ranking each individual functional GO term. Based on this ranking, GO terms are propagated to the target proteins. While Protein sequence enriches the sequence-based information, Protein Domain and Protein-Protein Interaction Networks embed structural/functional and topological based information, respectively, during the phase of GO ranking. Performance analysis of PFP-GO is also based on Precision, Recall, and F-Score. The same was found to perform reasonably better when compared to the other existing state-of-art. PFP-GO has achieved an overall Precision, Recall, and F-Score of 0.67, 0.58, and 0.62, respectively. Furthermore, we check some of the top-ranked GO terms predicted by PFP-GO through multilayer network propagation that affect the 3D structure of the genome. The complete source code of PFP-GO is freely available at https://sites.google.com/view/pfp-go/.
Collapse
Affiliation(s)
- Kaustav Sengupta
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Sovan Saha
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, West Bengal, India
| | - Anup Kumar Halder
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| | - Dariusz Plewczynski
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| |
Collapse
|
127
|
Panis F, Rompel A. The Novel Role of Tyrosinase Enzymes in the Storage of Globally Significant Amounts of Carbon in Wetland Ecosystems. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2022; 56:11952-11968. [PMID: 35944157 PMCID: PMC9454253 DOI: 10.1021/acs.est.2c03770] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 07/01/2022] [Accepted: 07/05/2022] [Indexed: 05/30/2023]
Abstract
Over the last millennia, wetlands have been sequestering carbon from the atmosphere via photosynthesis at a higher rate than releasing it and, therefore, have globally accumulated 550 × 1015 g of carbon, which is equivalent to 73% of the atmospheric carbon pool. The accumulation of organic carbon in wetlands is effectuated by phenolic compounds, which suppress the degradation of soil organic matter by inhibiting the activity of organic-matter-degrading enzymes. The enzymatic removal of phenolic compounds by bacterial tyrosinases has historically been blocked by anoxic conditions in wetland soils, resulting from waterlogging. Bacterial tyrosinases are a subgroup of oxidoreductases that oxidatively remove phenolic compounds, coupled to the reduction of molecular oxygen to water. The biochemical properties of bacterial tyrosinases have been investigated thoroughly in vitro within recent decades, while investigations focused on carbon fluxes in wetlands on a macroscopic level have remained a thriving yet separated research area so far. In the wake of climate change, however, anoxic conditions in wetland soils are threatened by reduced rainfall and prolonged summer drought. This potentially allows tyrosinase enzymes to reduce the concentration of phenolic compounds, which in turn will increase the release of stored carbon back into the atmosphere. To offer compelling evidence for the novel concept that bacterial tyrosinases are among the key enzymes influencing carbon cycling in wetland ecosystems first, bacterial organisms indigenous to wetland ecosystems that harbor a TYR gene within their respective genome (tyr+) have been identified, which revealed a phylogenetically diverse community of tyr+ bacteria indigenous to wetlands based on genomic sequencing data. Bacterial TYR host organisms covering seven phyla (Acidobacteria, Actinobacteria, Bacteroidetes, Firmicutes, Nitrospirae, Planctomycetes, and Proteobacteria) have been identified within various wetland ecosystems (peatlands, marshes, mangrove forests, bogs, and alkaline soda lakes) which cover a climatic continuum ranging from high arctic to tropic ecosystems. Second, it is demonstrated that (in vitro) bacterial TYR activity is commonly observed at pH values characteristic for wetland ecosystems (ranging from pH 3.5 in peatlands and freshwater swamps to pH 9.0 in soda lakes and freshwater marshes) and toward phenolic compounds naturally present within wetland environments (p-coumaric acid, gallic acid, protocatechuic acid, p-hydroxybenzoic acid, caffeic acid, catechin, and epicatechin). Third, analyzing the available data confirmed that bacterial host organisms tend to exhibit in vitro growth optima at pH values similar to their respective wetland habitats. Based on these findings, it is concluded that, following increased aeration of previously anoxic wetland soils due to climate change, TYRs are among the enzymes capable of reducing the concentration of phenolic compounds present within wetland ecosystems, which will potentially destabilize vast amounts of carbon stored in these ecosystems. Finally, promising approaches to mitigate the detrimental effects of increased TYR activity in wetland ecosystems and the requirement of future investigations of the abundance and activity of TYRs in an environmental setting are presented.
Collapse
|
128
|
Watanabe N, Yamamoto M, Murata M, Vavricka CJ, Ogino C, Kondo A, Araki M. Comprehensive Machine Learning Prediction of Extensive Enzymatic Reactions. J Phys Chem B 2022; 126:6762-6770. [PMID: 36053051 DOI: 10.1021/acs.jpcb.2c03287] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
New enzyme functions exist within the increasing number of unannotated protein sequences. Novel enzyme discovery is necessary to expand the pathways that can be accessed by metabolic engineering for the biosynthesis of functional compounds. Accordingly, various machine learning models have been developed to predict enzymatic reactions. However, the ability to predict unknown reactions that are not included in the training data has not been clarified. In order to cover uncertain and unknown reactions, a wider range of reaction types must be demonstrated by the models. Here, we establish 16 expanded enzymatic reaction prediction models developed using various machine learning algorithms, including deep neural network. Improvements in prediction performances over that of our previous study indicate that the updated methods are more effective for the prediction of enzymatic reactions. Overall, the deep neural network model trained with combined substrate-enzyme-product information exhibits the highest prediction accuracy with Macro F1 scores up to 0.966 and with robust prediction of unknown enzymatic reactions that are not included in the training data. This model can predict more extensive enzymatic reactions in comparison to previously reported models. This study will facilitate the discovery of new enzymes for the production of useful substances.
Collapse
Affiliation(s)
- Naoki Watanabe
- Department of Chemical Science and Engineering Graduate School of Engineering, Kobe University, 1-1 Rokkodai-cho, Nada, Kobe, Hyogo 657-8501, Japan
| | - Masaki Yamamoto
- Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin Sakyo-ku, Kyoto 606-8507, Japan
| | - Masahiro Murata
- Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin Sakyo-ku, Kyoto 606-8507, Japan
| | - Christopher J Vavricka
- Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan
| | - Chiaki Ogino
- Department of Chemical Science and Engineering Graduate School of Engineering, Kobe University, 1-1 Rokkodai-cho, Nada, Kobe, Hyogo 657-8501, Japan
| | - Akihiko Kondo
- Department of Chemical Science and Engineering Graduate School of Engineering, Kobe University, 1-1 Rokkodai-cho, Nada, Kobe, Hyogo 657-8501, Japan.,Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan
| | - Michihiro Araki
- Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin Sakyo-ku, Kyoto 606-8507, Japan.,Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan.,National Institutes of Biomedical Innovation, Health and Nutrition, National Institute of Health and Nutrition, 1-23-1 Toyama, Shinjuku-ku, Tokyo 162-8638, Japan.,National Cerebral and Cardiovascular Center, 6-1 Kishibe-Shinmachi, Suita, Osaka 564-8565, Japan
| |
Collapse
|
129
|
Ma W, Zhang S, Li Z, Jiang M, Wang S, Lu W, Bi X, Jiang H, Zhang H, Wei Z. Enhancing Protein Function Prediction Performance by Utilizing AlphaFold-Predicted Protein Structures. J Chem Inf Model 2022; 62:4008-4017. [PMID: 36006049 DOI: 10.1021/acs.jcim.2c00885] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The structure of a protein is of great importance in determining its functionality, and this characteristic can be leveraged to train data-driven prediction models. However, the limited number of available protein structures severely limits the performance of these models. AlphaFold2 and its open-source data set of predicted protein structures have provided a promising solution to this problem, and these predicted structures are expected to benefit the model performance by increasing the number of training samples. In this work, we constructed a new data set that acted as a benchmark and implemented a state-of-the-art structure-based approach for determining whether the performance of the function prediction model can be improved by putting additional AlphaFold-predicted structures into the training set and further compared the performance differences between two models separately trained with real structures only and AlphaFold-predicted structures only. Experimental results indicated that structure-based protein function prediction models could benefit from virtual training data consisting of AlphaFold-predicted structures. First, model performances were improved in all three categories of Gene Ontology terms (GO terms) after adding predicted structures as training samples. Second, the model trained only on AlphaFold-predicted virtual samples achieved comparable performances to the model based on experimentally solved real structures, suggesting that predicted structures were almost equally effective in predicting protein functionality.
Collapse
Affiliation(s)
- Wenjian Ma
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Shugang Zhang
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China.,High Performance Computing Center, Pilot National Laboratory for Marine Science and Technology (Qingdao), Qingdao 266237, China
| | - Zhen Li
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
| | - Mingjian Jiang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266033, China
| | - Shuang Wang
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
| | - Weigang Lu
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Xiangpeng Bi
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Huasen Jiang
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Henggui Zhang
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China.,High Performance Computing Center, Pilot National Laboratory for Marine Science and Technology (Qingdao), Qingdao 266237, China.,Biological Physics Group, School of Physics and Astronomy, University of Manchester, Manchester M13 9PL, U.K
| | - Zhiqiang Wei
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China.,High Performance Computing Center, Pilot National Laboratory for Marine Science and Technology (Qingdao), Qingdao 266237, China
| |
Collapse
|
130
|
Mansoor M, Nauman M, Rehman HU, Omar M. Gene Ontology Capsule GAN: an improved architecture for protein function prediction. PeerJ Comput Sci 2022; 8:e1014. [PMID: 36092003 PMCID: PMC9454774 DOI: 10.7717/peerj-cs.1014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Accepted: 05/31/2022] [Indexed: 06/15/2023]
Abstract
Proteins are the core of all functions pertaining to living things. They consist of an extended amino acid chain folding into a three-dimensional shape that dictates their behavior. Currently, convolutional neural networks (CNNs) have been pivotal in predicting protein functions based on protein sequences. While it is a technology crucial to the niche, the computation cost and translational invariance associated with CNN make it impossible to detect spatial hierarchies between complex and simpler objects. Therefore, this research utilizes capsule networks to capture spatial information as opposed to CNNs. Since capsule networks focus on hierarchical links, they have a lot of potential for solving structural biology challenges. In comparison to the standard CNNs, our results exhibit an improvement in accuracy. Gene Ontology Capsule GAN (GOCAPGAN) achieved an F1 score of 82.6%, a precision score of 90.4% and recall score of 76.1%.
Collapse
Affiliation(s)
- Musadaq Mansoor
- National University of Computer and Emerging Sciences, Islamabad, Peshawar, KPK, Pakistan
| | - Mohammad Nauman
- National University of Computer and Emerging Sciences, Islamabad, Peshawar, KPK, Pakistan
| | - Hafeez Ur Rehman
- National University of Computer and Emerging Sciences, Islamabad, Peshawar, KPK, Pakistan
| | - Maryam Omar
- National University of Computer and Emerging Sciences, Islamabad, Peshawar, KPK, Pakistan
| |
Collapse
|
131
|
Integration of Human Protein Sequence and Protein-Protein Interaction Data by Graph Autoencoder to Identify Novel Protein-Abnormal Phenotype Associations. Cells 2022; 11:cells11162485. [PMID: 36010562 PMCID: PMC9406402 DOI: 10.3390/cells11162485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 07/31/2022] [Accepted: 08/05/2022] [Indexed: 11/18/2022] Open
Abstract
Understanding gene functions and their associated abnormal phenotypes is crucial in the prevention, diagnosis and treatment against diseases. The Human Phenotype Ontology (HPO) is a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. However, the current HPO annotations are far from completion, and only a small fraction of human protein-coding genes has HPO annotations. Thus, it is necessary to predict protein-phenotype associations using computational methods. Protein sequences can indicate the structure and function of the proteins, and interacting proteins are more likely to have same function. It is promising to integrate these features for predicting HPO annotations of human protein. We developed GraphPheno, a semi-supervised method based on graph autoencoders, which does not require feature engineering to capture deep features from protein sequences, while also taking into account the topological properties in the protein–protein interaction network to predict the relationships between human genes/proteins and abnormal phenotypes. Cross validation and independent dataset tests show that GraphPheno has satisfactory prediction performance. The algorithm is further confirmed on automatic HPO annotation for no-knowledge proteins under the benchmark of the second Critical Assessment of Functional Annotation, 2013–2014 (CAFA2), where GraphPheno surpasses most existing methods. Further bioinformatics analysis shows that predicted certain phenotype-associated genes using GraphPheno share similar biological properties with known ones. In a case study on the phenotype of abnormality of mitochondrial respiratory chain, top prioritized genes are validated by recent papers. We believe that GraphPheno will help to reveal more associations between genes and phenotypes, and contribute to the discovery of drug targets.
Collapse
|
132
|
Wang S, Atkinson GRS, Hayes WB. SANA: cross-species prediction of Gene Ontology GO annotations via topological network alignment. NPJ Syst Biol Appl 2022; 8:25. [PMID: 35859153 PMCID: PMC9300714 DOI: 10.1038/s41540-022-00232-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 05/20/2022] [Indexed: 12/31/2022] Open
Abstract
Topological network alignment aims to align two networks node-wise in order to maximize the observed common connection (edge) topology between them. The topological alignment of two protein-protein interaction (PPI) networks should thus expose protein pairs with similar interaction partners allowing, for example, the prediction of common Gene Ontology (GO) terms. Unfortunately, no network alignment algorithm based on topology alone has been able to achieve this aim, though those that include sequence similarity have seen some success. We argue that this failure of topology alone is due to the sparsity and incompleteness of the PPI network data of almost all species, which provides the network topology with a small signal-to-noise ratio that is effectively swamped when sequence information is added to the mix. Here we show that the weak signal can be detected using multiple stochastic samples of "good" topological network alignments, which allows us to observe regions of the two networks that are robustly aligned across multiple samples. The resulting network alignment frequency (NAF) strongly correlates with GO-based Resnik semantic similarity and enables the first successful cross-species predictions of GO terms based on topology-only network alignments. Our best predictions have an AUPR of about 0.4, which is competitive with state-of-the-art algorithms, even when there is no observable sequence similarity and no known homology relationship. While our results provide only a "proof of concept" on existing network data, we hypothesize that predicting GO terms from topology-only network alignments will become increasingly practical as the volume and quality of PPI network data increase.
Collapse
Affiliation(s)
- Siyue Wang
- Department of Computer Science, University of California, Irvine, CA, 92697-3435, USA
| | - Giles R S Atkinson
- Department of Computer Science, University of California, Irvine, CA, 92697-3435, USA
| | - Wayne B Hayes
- Department of Computer Science, University of California, Irvine, CA, 92697-3435, USA.
| |
Collapse
|
133
|
Balaji A, Kille B, Kappell AD, Godbold GD, Diep M, Elworth RAL, Qian Z, Albin D, Nasko DJ, Shah N, Pop M, Segarra S, Ternus KL, Treangen TJ. SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning. Genome Biol 2022; 23:133. [PMID: 35725628 PMCID: PMC9208262 DOI: 10.1186/s13059-022-02695-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 05/25/2022] [Indexed: 11/10/2022] Open
Abstract
The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen .
Collapse
Affiliation(s)
- Advait Balaji
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Anthony D Kappell
- Signature Science, LLC, 8329 North Mopac Expressway, Austin, TX, USA
| | - Gene D Godbold
- Signature Science, LLC, 1670 Discovery Drive, Charlottesville, VA, USA
| | - Madeline Diep
- Fraunhofer USA Center Mid-Atlantic CMA, Riverdale, MD, USA
| | - R A Leo Elworth
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Zhiqin Qian
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Dreycey Albin
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Daniel J Nasko
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Nidhi Shah
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Santiago Segarra
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
| | - Krista L Ternus
- Signature Science, LLC, 8329 North Mopac Expressway, Austin, TX, USA.
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
134
|
Odrzywolek K, Karwowska Z, Majta J, Byrski A, Milanowska-Zabel K, Kosciolek T. Deep embeddings to comprehend and visualize microbiome protein space. Sci Rep 2022; 12:10332. [PMID: 35725732 PMCID: PMC9209496 DOI: 10.1038/s41598-022-14055-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 05/31/2022] [Indexed: 12/13/2022] Open
Abstract
Understanding the function of microbial proteins is essential to reveal the clinical potential of the microbiome. The application of high-throughput sequencing technologies allows for fast and increasingly cheaper acquisition of data from microbial communities. However, many of the inferred protein sequences are novel and not catalogued, hence the possibility of predicting their function through conventional homology-based approaches is limited, which indicates the need for further research on alignment-free methods. Here, we leverage a deep-learning-based representation of proteins to assess its utility in alignment-free analysis of microbial proteins. We trained a language model on the Unified Human Gastrointestinal Protein catalogue and validated the resulting protein representation on the bacterial part of the SwissProt database. Finally, we present a use case on proteins involved in SCFA metabolism. Results indicate that the deep learning model manages to accurately represent features related to protein structure and function, allowing for alignment-free protein analyses. Technologies that contextualize metagenomic data are a promising direction to deeply understand the microbiome.
Collapse
Affiliation(s)
- Krzysztof Odrzywolek
- Ardigen, Podole 76, 30-394, Krakow, Poland
- Institute of Computer Science, Faculty of Computer Science, Electronics and Telecommunications, AGH University of Science and Technology, Mickiewicza 30, 30-059, Krakow, Poland
| | - Zuzanna Karwowska
- Malopolska Centre of Biotechnology, Jagiellonian University, Gronostajowa 7A, 30-387, Krakow, Poland
| | - Jan Majta
- Ardigen, Podole 76, 30-394, Krakow, Poland
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, 30-387, Krakow, Poland
| | - Aleksander Byrski
- Institute of Computer Science, Faculty of Computer Science, Electronics and Telecommunications, AGH University of Science and Technology, Mickiewicza 30, 30-059, Krakow, Poland
| | | | - Tomasz Kosciolek
- Malopolska Centre of Biotechnology, Jagiellonian University, Gronostajowa 7A, 30-387, Krakow, Poland.
| |
Collapse
|
135
|
Kagaya Y, Flannery ST, Jain A, Kihara D. ContactPFP: Protein Function Prediction Using Predicted Contact Information. FRONTIERS IN BIOINFORMATICS 2022; 2. [PMID: 35875419 PMCID: PMC9302406 DOI: 10.3389/fbinf.2022.896295] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Computational function prediction is one of the most important problems in bioinformatics as elucidating the function of genes is a central task in molecular biology and genomics. Most of the existing function prediction methods use protein sequences as the primary source of input information because the sequence is the most available information for query proteins. There are attempts to consider other attributes of query proteins. Among these attributes, the three-dimensional (3D) structure of proteins is known to be very useful in identifying the evolutionary relationship of proteins, from which functional similarity can be inferred. Here, we report a novel protein function prediction method, ContactPFP, which uses predicted residue-residue contact maps as input structural features of query proteins. Although 3D structure information is known to be useful, it has not been routinely used in function prediction because the 3D structure is not experimentally determined for many proteins. In ContactPFP, we overcome this limitation by using residue-residue contact prediction, which has become increasingly accurate due to rapid development in the protein structure prediction field. ContactPFP takes a query protein sequence as input and uses predicted residue-residue contact as a proxy for the 3D protein structure. To characterize how predicted contacts contribute to function prediction accuracy, we compared the performance of ContactPFP with several well-established sequence-based function prediction methods. The comparative study revealed the advantages and weaknesses of ContactPFP compared to contemporary sequence-based methods. There were many cases where it showed higher prediction accuracy. We examined factors that affected the accuracy of ContactPFP using several illustrative cases that highlight the strength of our method.
Collapse
Affiliation(s)
- Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| | - Sean T. Flannery
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Aashish Jain
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
- *Correspondence: Daisuke Kihara,
| |
Collapse
|
136
|
Cong H, Liu H, Cao Y, Chen Y, Liang C. Multiple Protein Subcellular Locations Prediction Based on Deep Convolutional Neural Networks with Self-Attention Mechanism. Interdiscip Sci 2022; 14:421-438. [PMID: 35066812 DOI: 10.1007/s12539-021-00496-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2021] [Revised: 12/06/2021] [Accepted: 12/13/2021] [Indexed: 12/12/2022]
Abstract
As an important research field in bioinformatics, protein subcellular location prediction is critical to reveal the protein functions and provide insightful information for disease diagnosis and drug development. Predicting protein subcellular locations remains a challenging task due to the difficulty of finding representative features and robust classifiers. Many feature fusion methods have been widely applied to tackle the above issues. However, they still suffer from accuracy loss due to feature redundancy. Furthermore, multiple protein subcellular locations prediction is more complicated since it is fundamentally a multi-label classification problem. The traditional binary classifiers or even multi-class classifiers cannot achieve satisfactory results. This paper proposes a novel method for protein subcellular location prediction with both single and multiple sites based on deep convolutional neural networks. Specifically, we first obtain the integrated features by simultaneously considering the pseudo amino acid, amino acid index distribution, and physicochemical property. We then adopt deep convolutional neural networks to extract high-dimensional features from the fused feature, removing the redundant preliminary features and gaining better representations of the raw sequences. Moreover, we use the self-attention mechanism and a customized loss function to ensure that the model is more inclined to positive data. In addition, we use random k-label sets to reduce the number of prediction labels. Meanwhile, we employ a hybrid strategy of over-sampling and under-sampling to tackle the data imbalance problem. We compare our model with three representative classification alternatives. The experiment results show that our model achieves the best performance in terms of accuracy, demonstrating the efficacy of the proposed model.
Collapse
Affiliation(s)
- Hanhan Cong
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
- Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan, China
| | - Hong Liu
- School of Information Science and Engineering, Shandong Normal University, Jinan, China.
- Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan, China.
| | - Yi Cao
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent, Computing University of Jinan, Jinan, China
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent, Computing University of Jinan, Jinan, China
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
| |
Collapse
|
137
|
Smalley S, Hellmann H. Review: Exploring possible approaches using ubiquitylation and sumoylation pathways in modifying plant stress tolerance. PLANT SCIENCE : AN INTERNATIONAL JOURNAL OF EXPERIMENTAL PLANT BIOLOGY 2022; 319:111275. [PMID: 35487671 DOI: 10.1016/j.plantsci.2022.111275] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 03/24/2022] [Accepted: 03/25/2022] [Indexed: 06/14/2023]
Abstract
Ubiquitin and similar proteins, such as SUMO, are utilized by plants to modify target proteins to rapidly change their stability and activity in cells. This review will provide an overview of these crucial protein interactions with a focus on ubiquitylation and sumoylation in plants and how they contribute to stress tolerance. The work will also explore possibilities to use these highly conserved pathways for novel approaches to generate more robust crop plants better fit to cope with abiotic and biotic stress situations.
Collapse
Affiliation(s)
- Samuel Smalley
- Washington State University, Pullman, WA 99164, United States
| | - Hanjo Hellmann
- Washington State University, Pullman, WA 99164, United States.
| |
Collapse
|
138
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
139
|
Sapoval N, Aghazadeh A, Nute MG, Antunes DA, Balaji A, Baraniuk R, Barberan CJ, Dannenfelser R, Dun C, Edrisi M, Elworth RAL, Kille B, Kyrillidis A, Nakhleh L, Wolfe CR, Yan Z, Yao V, Treangen TJ. Current progress and open challenges for applying deep learning across the biosciences. Nat Commun 2022; 13:1728. [PMID: 35365602 PMCID: PMC8976012 DOI: 10.1038/s41467-022-29268-7] [Citation(s) in RCA: 112] [Impact Index Per Article: 37.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Accepted: 03/09/2022] [Indexed: 11/19/2022] Open
Abstract
Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.
Collapse
Affiliation(s)
- Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Amirali Aghazadeh
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Michael G Nute
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Dinler A Antunes
- Department of Biology and Biochemistry, University of Houston, Houston, TX, USA
| | - Advait Balaji
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Richard Baraniuk
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
| | - C J Barberan
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
| | | | - Chen Dun
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - R A Leo Elworth
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Cameron R Wolfe
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Zhi Yan
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Vicky Yao
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA.
- Department of Bioengineering, Rice University, Houston, TX, USA.
| |
Collapse
|
140
|
Machine learning discovery of missing links that mediate alternative branches to plant alkaloids. Nat Commun 2022; 13:1405. [PMID: 35296652 PMCID: PMC8927377 DOI: 10.1038/s41467-022-28883-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Accepted: 02/16/2022] [Indexed: 01/12/2023] Open
Abstract
Engineering the microbial production of secondary metabolites is limited by the known reactions of correctly annotated enzymes. Therefore, the machine learning discovery of specialized enzymes offers great potential to expand the range of biosynthesis pathways. Benzylisoquinoline alkaloid production is a model example of metabolic engineering with potential to revolutionize the paradigm of sustainable biomanufacturing. Existing bacterial studies utilize a norlaudanosoline pathway, whereas plants contain a more stable norcoclaurine pathway, which is exploited in yeast. However, committed aromatic precursors are still produced using microbial enzymes that remain elusive in plants, and additional downstream missing links remain hidden within highly duplicated plant gene families. In the current study, machine learning is applied to predict and select plant missing link enzymes from homologous candidate sequences. Metabolomics-based characterization of the selected sequences reveals potential aromatic acetaldehyde synthases and phenylpyruvate decarboxylases in reconstructed plant gene-only benzylisoquinoline alkaloid pathways from tyrosine. Synergistic application of the aryl acetaldehyde producing enzymes results in enhanced benzylisoquinoline alkaloid production through hybrid norcoclaurine and norlaudanosoline pathways. Producing plant secondary metabolites by microbes is limited by the known enzymatic reactions. Here, the authors apply machine learning to predict missing link enzymes of benzylisoquinoline alkaloid (BIA) biosynthesis in Papaver somniferum, and validate the specialized activities through heterologous production.
Collapse
|
141
|
Ilgisonis EV, Pogodin PV, Kiseleva OI, Tarbeeva SN, Ponomarenko EA. Evolution of Protein Functional Annotation: Text Mining Study. J Pers Med 2022; 12:jpm12030479. [PMID: 35330478 PMCID: PMC8952229 DOI: 10.3390/jpm12030479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 03/07/2022] [Accepted: 03/08/2022] [Indexed: 11/23/2022] Open
Abstract
Within the Human Proteome Project initiative framework for creating functional annotations of uPE1 proteins, the neXt-CP50 Challenge was launched in 2018. In analogy with the missing-protein challenge, each command deciphers the functional features of the proteins in the chromosome-centric mode. However, the neXt-CP50 Challenge is more complicated than the missing-protein challenge: the approaches and methods for solving the problem are clear, but neither the concept of protein function nor specific experimental and/or bioinformatics protocols have been standardized to address it. We proposed using a retrospective analysis of the key HPP repository, the neXtProt database, to identify the most frequently used experimental and bioinformatic methods for analyzing protein functions, and the dynamics of accumulation of functional annotations. It has been shown that the dynamics of the increase in the number of proteins with known functions are greater than the progress made in the experimental confirmation of the existence of questionable proteins in the framework of the missing-protein challenge. At the same time, the functional annotation is based on the guilty-by-association postulate, according to which, based on large-scale experiments on API-MS and Y2H, proteins with unknown functions are most likely mapped through “handshakes” to biochemical processes.
Collapse
|
142
|
Pasternak Z, Chapnik N, Yosef R, Kopelman NM, Jurkevitch E, Segev E. Identifying protein function and functional links based on large-scale co-occurrence patterns. PLoS One 2022; 17:e0264765. [PMID: 35239724 PMCID: PMC8893610 DOI: 10.1371/journal.pone.0264765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 02/16/2022] [Indexed: 11/23/2022] Open
Abstract
Objective The vast majority of known proteins have not been experimentally tested even at the level of measuring their expression, and the function of many proteins remains unknown. In order to decipher protein function and examine functional associations, we developed "Cliquely", a software tool based on the exploration of co-occurrence patterns. Computational model Using a set of more than 23 million proteins divided into 404,947 orthologous clusters, we explored the co-occurrence graph of 4,742 fully sequenced genomes from the three domains of life. Edge weights in this graph represent co-occurrence probabilities. We use the Bron–Kerbosch algorithm to detect maximal cliques in this graph, fully-connected subgraphs that represent meaningful biological networks from different functional categories. Main results We demonstrate that Cliquely can successfully identify known networks from various pathways, including nitrogen fixation, glycolysis, methanogenesis, mevalonate and ribosome proteins. Identifying the virulence-associated type III secretion system (T3SS) network, Cliquely also added 13 previously uncharacterized novel proteins to the T3SS network, demonstrating the strength of this approach. Cliquely is freely available and open source. Users can employ the tool to explore co-occurrence networks using a protein of interest and a customizable level of stringency, either for the entire dataset or for a one of the three domains—Archaea, Bacteria, or Eukarya.
Collapse
Affiliation(s)
- Zohar Pasternak
- Division of Identification and Forensic Science, Israel Police, Jerusalem, Israel
- Faculty of Management of Technology, Holon Institute of Technology, Holon, Israel
| | - Noam Chapnik
- Faculty of Management of Technology, Holon Institute of Technology, Holon, Israel
| | - Roy Yosef
- Faculty of Management of Technology, Holon Institute of Technology, Holon, Israel
| | - Naama M. Kopelman
- Faculty of Science, Holon Institute of Technology, Holon, Israel
- * E-mail:
| | - Edouard Jurkevitch
- Department of Plant Pathology and Microbiology, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Elad Segev
- Faculty of Science, Holon Institute of Technology, Holon, Israel
| |
Collapse
|
143
|
Long short term memory based functional characterization model for unknown protein sequences using ensemble of shallow and deep features. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06674-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
144
|
Reijnders MJ. Wei2GO: weighted sequence similarity-based protein function prediction. PeerJ 2022; 10:e12931. [PMID: 35186498 PMCID: PMC8855713 DOI: 10.7717/peerj.12931] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Accepted: 01/21/2022] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Protein function prediction is an important part of bioinformatics and genomics studies. There are many different predictors available, however most of these are in the form of web-servers instead of open-source locally installable versions. Such local versions are necessary to perform large scale genomics studies due to the presence of limitations imposed by web servers such as queues, prediction speed, and updatability of databases. METHODS This paper describes Wei2GO: a weighted sequence similarity and python-based open-source protein function prediction software. It uses DIAMOND and HMMScan sequence alignment searches against the UniProtKB and Pfam databases respectively, transfers Gene Ontology terms from the reference protein to the query protein, and uses a weighing algorithm to calculate a score for the Gene Ontology annotations. RESULTS Wei2GO is compared against the Argot2 and Argot2.5 web servers, which use a similar concept, and DeepGOPlus which acts as a reference. Wei2GO shows an increase in performance according to precision and recall curves, Fmax scores, and Smin scores for biological process and molecular function ontologies. Computational time compared to Argot2 and Argot2.5 is decreased from several hours to several minutes. AVAILABILITY Wei2GO is written in Python 3, and can be found at https://gitlab.com/mreijnders/Wei2GO.
Collapse
Affiliation(s)
- Maarten J.M.F. Reijnders
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
145
|
Zhao C, Liu T, Wang Z. PANDA2: protein function prediction using graph neural networks. NAR Genom Bioinform 2022; 4:lqac004. [PMID: 35118378 PMCID: PMC8808544 DOI: 10.1093/nargab/lqac004] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Revised: 11/20/2021] [Accepted: 01/05/2022] [Indexed: 12/13/2022] Open
Abstract
High-throughput sequencing technologies have generated massive protein sequences, but the annotations of protein sequences highly rely on the low-throughput and expensive biological experiments. Therefore, accurate and fast computational alternatives are needed to infer functional knowledge from protein sequences. The gene ontology (GO) directed acyclic graph (DAG) contains the hierarchical relationships between GO terms but is hard to be integrated into machine learning algorithms for functional predictions. We developed a deep learning system named PANDA2 to predict protein functions, which used the cutting-edge graph neural network to model the topology of the GO DAG and integrated the features generated by transformer protein language models. Compared with the top 10 methods in CAFA3, PANDA2 ranked first in cellular component ontology (CCO), tied first in biological process ontology (BPO) but had a higher coverage rate, and second in molecular function ontology (MFO). Compared with other recently-developed cutting-edge predictors DeepGOPlus, GOLabeler, and DeepText2GO, and benchmarked on another independent dataset, PANDA2 ranked first in CCO, first in BPO, and second in MFO. PANDA2 can be freely accessed from http://dna.cs.miami.edu/PANDA2/.
Collapse
Affiliation(s)
- Chenguang Zhao
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL 33124, USA
| | - Tong Liu
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL 33124, USA
| | - Zheng Wang
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL 33124, USA
| |
Collapse
|
146
|
Zancolli G, Reijnders M, Waterhouse RM, Robinson-Rechavi M. Convergent evolution of venom gland transcriptomes across Metazoa. Proc Natl Acad Sci U S A 2022; 119:e2111392119. [PMID: 34983844 PMCID: PMC8740685 DOI: 10.1073/pnas.2111392119] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/10/2021] [Indexed: 12/13/2022] Open
Abstract
Animals have repeatedly evolved specialized organs and anatomical structures to produce and deliver a mixture of potent bioactive molecules to subdue prey or predators-venom. This makes it one of the most widespread, convergent functions in the animal kingdom. Whether animals have adopted the same genetic toolkit to evolved venom systems is a fascinating question that still eludes us. Here, we performed a comparative analysis of venom gland transcriptomes from 20 venomous species spanning the main Metazoan lineages to test whether different animals have independently adopted similar molecular mechanisms to perform the same function. We found a strong convergence in gene expression profiles, with venom glands being more similar to each other than to any other tissue from the same species, and their differences closely mirroring the species phylogeny. Although venom glands secrete some of the fastest evolving molecules (toxins), their gene expression does not evolve faster than evolutionarily older tissues. We found 15 venom gland-specific gene modules enriched in endoplasmic reticulum stress and unfolded protein response pathways, indicating that animals have independently adopted stress response mechanisms to cope with mass production of toxins. This, in turn, activates regulatory networks for epithelial development, cell turnover, and maintenance, which seem composed of both convergent and lineage-specific factors, possibly reflecting the different developmental origins of venom glands. This study represents a first step toward an understanding of the molecular mechanisms underlying the repeated evolution of one of the most successful adaptive traits in the animal kingdom.
Collapse
Affiliation(s)
- Giulia Zancolli
- Department of Ecology and Evolution, University of Lausanne, Lausanne 1015, Switzerland;
- Evolutionary Bioinformatics Group, Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Maarten Reijnders
- Department of Ecology and Evolution, University of Lausanne, Lausanne 1015, Switzerland
- Evolutionary-Functional Genomics Group, Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Robert M Waterhouse
- Department of Ecology and Evolution, University of Lausanne, Lausanne 1015, Switzerland
- Evolutionary-Functional Genomics Group, Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, Lausanne 1015, Switzerland
- Evolutionary Bioinformatics Group, Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| |
Collapse
|
147
|
Dugé de Bernonville T, Amor Stander E, Dugé de Bernonville G, Besseau S, Courdavault V. Predicting Monoterpene Indole Alkaloid-Related Genes from Expression Data with Artificial Neural Networks. Methods Mol Biol 2022; 2505:131-140. [PMID: 35732942 DOI: 10.1007/978-1-0716-2349-7_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Elucidation of biological pathways leading to specialized metabolites remains a complex task. It is however a mandatory step to allow bioproduction into heterologous hosts. Many steps have already been identified using conventional approaches, enlarging the space of known possible chemical steps. In the recent past years, identification of missing steps has been fueled by the generation of genomic and transcriptomic data for nonmodel species. The analysis of gene expression profiles has revealed that in many cases, genes encoding enzymes involved in the same biosynthetic pathways are coexpressed across different tissue types and environmental conditions. Hence, coexpressed studies, either in the form of differential gene expression, gene coexpression network, or unsupervised clustering methods, have helped deciphering missing steps to complete knowledge on biosynthetic pathways. Already identified biosynthetic steps can be used as baits to capture the remaining unknown steps. The present protocol shows how supervised machine learning in the form of artificial neural networks (ANNs) can efficiently classify genes as specialized metabolism related or not according to their expression levels. Using Catharanthus roseus as an example, we show that ANN trained on a minimal set of bait genes results in many true positives (correctly predicted genes) while keeping false positives low (containing possible candidate genes).
Collapse
Affiliation(s)
| | - Emily Amor Stander
- EA2106 Biomolécules et Biotechnologies Végétales, Université de Tours, Tours, France
| | | | - Sébastien Besseau
- EA2106 Biomolécules et Biotechnologies Végétales, Université de Tours, Tours, France
| | - Vincent Courdavault
- EA2106 Biomolécules et Biotechnologies Végétales, Université de Tours, Tours, France
| |
Collapse
|
148
|
Dutta D, Banerjee S, Pal M, Gangopadhyay G. Validation of determinate ( dt) gene-based DNA marker in inter-specific hybrid sesame and in-silico analysis of the predicted dt protein structures. PHYSIOLOGY AND MOLECULAR BIOLOGY OF PLANTS : AN INTERNATIONAL JOURNAL OF FUNCTIONAL PLANT BIOLOGY 2022; 28:139-152. [PMID: 35221576 PMCID: PMC8847511 DOI: 10.1007/s12298-022-01135-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Revised: 12/04/2021] [Accepted: 01/14/2022] [Indexed: 06/14/2023]
Abstract
UNLABELLED Determinacy is a desirable trait in sesame, an important oilseed crop. We have developed an inter-specific hybrid between basally branched indeterminate cultivated Sesamum indicum genotype and wild S. prostratum with no branching yet synchronous pods on the shoot. The hybrid and a few exotic sesame germplasms were successfully screened with a determinacy (dt) gene-based DNA marker. In-silico translation of the partial coding sequences of the dt gene from the two contrasting parent genotypes revealed an SNP (V159A) in S. prostratum. The predicted cytoplasmic dt protein showed a high resemblance with flowering protein centroradialis. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s12298-022-01135-1.
Collapse
Affiliation(s)
- Debabrata Dutta
- Division of Plant Biology, Bose Institute (Main Campus), 93/1 APC Road, Kolkata, 700009 India
| | - Saptadipa Banerjee
- Division of Plant Biology, Bose Institute (Main Campus), 93/1 APC Road, Kolkata, 700009 India
| | - Manisha Pal
- Division of Plant Biology, Bose Institute (Main Campus), 93/1 APC Road, Kolkata, 700009 India
| | - Gaurab Gangopadhyay
- Division of Plant Biology, Bose Institute (Main Campus), 93/1 APC Road, Kolkata, 700009 India
| |
Collapse
|
149
|
Zhang Z, Zhao Y, Wang J, Guo M. DeepRCI: predicting ATP-binding proteins using the residue-residue contact information. IEEE J Biomed Health Inform 2021; 26:2822-2829. [PMID: 34941538 DOI: 10.1109/jbhi.2021.3137840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Adenine-5'-triphosphate (ATP) is a direct energy source for various activities of tissues and cells in the body. The release of ATP energies requires the assistance of ATP-binding proteins. Therefore, the identification of ATP-binding proteins is of great significance for the research on organisms. So far, there are several methods for predicting ATP-binding proteins. However, the accuracies of these methods are so low that the predicted proteins are inaccurate. Here, we designed a novel method, called as DeepRCI (based on Deep convolutional neural network and Residue-residue Contact Information), for predicting ATP-binding proteins. DeepRCI achieved an accuracy of 93.61\% on the test set which was a significant improvement over the state-of-the-art methods.
Collapse
|
150
|
Insights of the Neofusicoccum parvum- Liquidambar styraciflua Interaction and Identification of New Cysteine-Rich Proteins in Both Species. J Fungi (Basel) 2021; 7:jof7121027. [PMID: 34947009 PMCID: PMC8707630 DOI: 10.3390/jof7121027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 11/26/2021] [Accepted: 11/26/2021] [Indexed: 11/16/2022] Open
Abstract
Neofusicoccum parvum belongs to the Botryosphaeriaceae family, which contains endophytes and pathogens of woody plants. In this study, we isolated 11 strains from diseased tissue of Liquidambar styraciflua. Testing with Koch's postulates-followed by a molecular approach-revealed that N. parvum was the most pathogenic strain. We established an in vitro pathosystem (L. styraciflua foliar tissue-N. parvum) in order to characterize the infection process during the first 16 days. New CysRPs were identified for both organisms using public transcriptomic and genomic databases, while mRNA expression of CysRPs was analyzed by RT-qPCR. The results showed that N. parvum caused disease symptoms after 24 h that intensified over time. Through in silico analysis, 5 CysRPs were identified for each organism, revealing that all of the proteins are potentially secreted and novel, including two of N. parvum proteins containing the CFEM domain. Interestingly, the levels of the CysRPs mRNAs change during the interaction. This study reports N. parvum as a pathogen of L. styraciflua for the first time and highlights the potential involvement of CysRPs in both organisms during this interaction.
Collapse
|