1
|
Monjot A, Rousseau J, Bittner L, Lepère C. Metatranscriptomes-based sequence similarity networks uncover genetic signatures within parasitic freshwater microbial eukaryotes. MICROBIOME 2025; 13:43. [PMID: 39915863 PMCID: PMC11800578 DOI: 10.1186/s40168-024-02027-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Accepted: 12/31/2024] [Indexed: 02/09/2025]
Abstract
BACKGROUND Microbial eukaryotes play a crucial role in biochemical cycles and aquatic trophic food webs. Their taxonomic and functional diversity are increasingly well described due to recent advances in sequencing technologies. However, the vast amount of data produced by -omics approaches require data-driven methodologies to make predictions about these microorganisms' role within ecosystems. Using metatranscriptomics data, we employed a sequence similarity network-based approach to explore the metabolic specificities of microbial eukaryotes with different trophic modes in a freshwater ecosystem (Lake Pavin, France). RESULTS A total of 2,165,106 proteins were clustered in connected components enabling analysis of a great number of sequences without any references in public databases. This approach coupled with the use of an in-house trophic modes database improved the number of proteins considered by 42%. Our study confirmed the versatility of mixotrophic metabolisms with a large number of shared protein families among mixotrophic and phototrophic microorganisms as well as mixotrophic and heterotrophic microorganisms. Genetic similarities in proteins of saprotrophs and parasites also suggest that fungi-like organisms from Lake Pavin, such as Chytridiomycota and Oomycetes, exhibit a wide range of lifestyles, influenced by their degree of dependence on a host. This plasticity may occur at a fine taxonomic level (e.g., species level) and likely within a single organism in response to environmental parameters. While we observed a relative functional redundancy of primary metabolisms (e.g., amino acid and carbohydrate metabolism) nearly 130,000 protein families appeared to be trophic mode-specific. We found a particular specificity in obligate parasite-related Specific Protein Clusters, underscoring a high degree of specialization in these organisms. CONCLUSIONS Although no universal marker for parasitism was identified, candidate genes can be proposed at a fine taxonomic scale. We notably provide several protein families that could serve as keys to understanding host-parasite interactions representing pathogenicity factors (e.g., involved in hijacking host resources, or associated with immune evasion mechanisms). All these protein families could offer valuable insights for developing antiparasitic treatments in health and economic contexts. Video Abstract.
Collapse
Affiliation(s)
- Arthur Monjot
- CNRS, Laboratoire Microorganismes: Génome Et Environnement, Université Clermont Auvergne, Clermont-Ferrand, 63000, France.
| | - Jérémy Rousseau
- Institut de Systématique, Evolution, Biodiversité (ISYEB), Muséum National d'Histoire Naturelle, CNRS, Sorbonne Université, EPHE, Université Des Antilles, Paris, France
| | - Lucie Bittner
- Institut de Systématique, Evolution, Biodiversité (ISYEB), Muséum National d'Histoire Naturelle, CNRS, Sorbonne Université, EPHE, Université Des Antilles, Paris, France
- Institut Universitaire de France, Paris, France
| | - Cécile Lepère
- CNRS, Laboratoire Microorganismes: Génome Et Environnement, Université Clermont Auvergne, Clermont-Ferrand, 63000, France.
| |
Collapse
|
2
|
Svedberg D, Winiger RR, Berg A, Sharma H, Tellgren-Roth C, Debrunner-Vossbrinck BA, Vossbrinck CR, Barandun J. Functional annotation of a divergent genome using sequence and structure-based similarity. BMC Genomics 2024; 25:6. [PMID: 38166563 PMCID: PMC10759460 DOI: 10.1186/s12864-023-09924-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2023] [Accepted: 12/18/2023] [Indexed: 01/04/2024] Open
Abstract
BACKGROUND Microsporidia are a large taxon of intracellular pathogens characterized by extraordinarily streamlined genomes with unusually high sequence divergence and many species-specific adaptations. These unique factors pose challenges for traditional genome annotation methods based on sequence similarity. As a result, many of the microsporidian genomes sequenced to date contain numerous genes of unknown function. Recent innovations in rapid and accurate structure prediction and comparison, together with the growing amount of data in structural databases, provide new opportunities to assist in the functional annotation of newly sequenced genomes. RESULTS In this study, we established a workflow that combines sequence and structure-based functional gene annotation approaches employing a ChimeraX plugin named ANNOTEX (Annotation Extension for ChimeraX), allowing for visual inspection and manual curation. We employed this workflow on a high-quality telomere-to-telomere sequenced tetraploid genome of Vairimorpha necatrix. First, the 3080 predicted protein-coding DNA sequences, of which 89% were confirmed with RNA sequencing data, were used as input. Next, ColabFold was used to create protein structure predictions, followed by a Foldseek search for structural matching to the PDB and AlphaFold databases. The subsequent manual curation, using sequence and structure-based hits, increased the accuracy and quality of the functional genome annotation compared to results using only traditional annotation tools. Our workflow resulted in a comprehensive description of the V. necatrix genome, along with a structural summary of the most prevalent protein groups, such as the ricin B lectin family. In addition, and to test our tool, we identified the functions of several previously uncharacterized Encephalitozoon cuniculi genes. CONCLUSION We provide a new functional annotation tool for divergent organisms and employ it on a newly sequenced, high-quality microsporidian genome to shed light on this uncharacterized intracellular pathogen of Lepidoptera. The addition of a structure-based annotation approach can serve as a valuable template for studying other microsporidian or similarly divergent species.
Collapse
Affiliation(s)
- Dennis Svedberg
- Department of Molecular Biology, The Laboratory for Molecular Infection Medicine Sweden (MIMS), Science for Life Laboratory, Umeå Centre for Microbial Research (UCMR), Umeå University, Umeå, 90187, Sweden
- Department of Medical Biochemistry and Biophysics, Umeå University, Umeå, 90736, Sweden
| | - Rahel R Winiger
- Department of Molecular Biology, The Laboratory for Molecular Infection Medicine Sweden (MIMS), Science for Life Laboratory, Umeå Centre for Microbial Research (UCMR), Umeå University, Umeå, 90187, Sweden
| | - Alexandra Berg
- Department of Molecular Biology, The Laboratory for Molecular Infection Medicine Sweden (MIMS), Science for Life Laboratory, Umeå Centre for Microbial Research (UCMR), Umeå University, Umeå, 90187, Sweden
- Department of Medical Biochemistry and Biophysics, Umeå University, Umeå, 90736, Sweden
| | - Himanshu Sharma
- Department of Molecular Biology, The Laboratory for Molecular Infection Medicine Sweden (MIMS), Science for Life Laboratory, Umeå Centre for Microbial Research (UCMR), Umeå University, Umeå, 90187, Sweden
- Department of Medical Biochemistry and Biophysics, Umeå University, Umeå, 90736, Sweden
| | - Christian Tellgren-Roth
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | | | - Charles R Vossbrinck
- Department of Environmental Science, Connecticut Agricultural Experiment Station, New Haven, CT, 06504, USA
| | - Jonas Barandun
- Department of Molecular Biology, The Laboratory for Molecular Infection Medicine Sweden (MIMS), Science for Life Laboratory, Umeå Centre for Microbial Research (UCMR), Umeå University, Umeå, 90187, Sweden.
| |
Collapse
|
3
|
Wang J, Chen C, Yao G, Ding J, Wang L, Jiang H. Intelligent Protein Design and Molecular Characterization Techniques: A Comprehensive Review. Molecules 2023; 28:7865. [PMID: 38067593 PMCID: PMC10707872 DOI: 10.3390/molecules28237865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 11/13/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023] Open
Abstract
In recent years, the widespread application of artificial intelligence algorithms in protein structure, function prediction, and de novo protein design has significantly accelerated the process of intelligent protein design and led to many noteworthy achievements. This advancement in protein intelligent design holds great potential to accelerate the development of new drugs, enhance the efficiency of biocatalysts, and even create entirely new biomaterials. Protein characterization is the key to the performance of intelligent protein design. However, there is no consensus on the most suitable characterization method for intelligent protein design tasks. This review describes the methods, characteristics, and representative applications of traditional descriptors, sequence-based and structure-based protein characterization. It discusses their advantages, disadvantages, and scope of application. It is hoped that this could help researchers to better understand the limitations and application scenarios of these methods, and provide valuable references for choosing appropriate protein characterization techniques for related research in the field, so as to better carry out protein research.
Collapse
Affiliation(s)
| | | | | | - Junjie Ding
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| | - Liangliang Wang
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| | - Hui Jiang
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| |
Collapse
|
4
|
Zheng R, Huang Z, Deng L. Large-scale predicting protein functions through heterogeneous feature fusion. Brief Bioinform 2023:bbad243. [PMID: 37401369 DOI: 10.1093/bib/bbad243] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 05/18/2023] [Accepted: 06/12/2023] [Indexed: 07/05/2023] Open
Abstract
As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.
Collapse
Affiliation(s)
- Rongtao Zheng
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| | - Zhijian Huang
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, 410000 Changsha, China
| |
Collapse
|
5
|
Xu W, Zhao Z, Zhang H, Hu M, Yang N, Wang H, Wang C, Jiao J, Gu L. Deep neural learning based protein function prediction. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:2471-2488. [PMID: 35240793 DOI: 10.3934/mbe.2022114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
It is vital for the annotation of uncharacterized proteins by protein function prediction. At present, Deep Neural Network based protein function prediction is mainly carried out for dataset of small scale proteins or Gene Ontology, and usually explore the relationships between single protein feature and function tags. The practical methods for large-scale multi-features protein prediction still need to be studied in depth. This paper proposes a DNN based protein function prediction approach IGP-DNN. This method uses Grasshopper Optimization Algorithm (GOA) and Intuitionistic Fuzzy c-Means clustering (IFCM) based protein function modules extracting algorithm to extract the features of protein modules, utilizing Kernel Principal Component Analysis (KPCA) method to reduce the dimensionality of the protein attribute information, and integrating module features and attribute features. Inputting integrated data into DNN through multiple hidden layers to classify proteins and predict protein functions. In the experiments, the F-measure value of IGP-DNN on the DIP dataset reaches 0.4436, which shows better performance.
Collapse
Affiliation(s)
- Wenjun Xu
- Key Laboratory of Agricultural Electronic Commerce, Ministry of Agriculture, Hefei 230036, China
- Institute of Intelligent Agriculture, Anhui Agricultural University, Hefei 230036, China
- School of Life Sciences, Anhui Agricultural University, Hefei 230036, China
| | - Zihao Zhao
- School of Information and Computer, Anhui Agricultural University, Hefei 230036, China
- Key Laboratory of Agricultural Electronic Commerce, Ministry of Agriculture, Hefei 230036, China
- Institute of Intelligent Agriculture, Anhui Agricultural University, Hefei 230036, China
| | - Hongwei Zhang
- School of Information and Computer, Anhui Agricultural University, Hefei 230036, China
- Key Laboratory of Agricultural Electronic Commerce, Ministry of Agriculture, Hefei 230036, China
- Institute of Intelligent Agriculture, Anhui Agricultural University, Hefei 230036, China
| | - Minglei Hu
- School of Information and Computer, Anhui Agricultural University, Hefei 230036, China
- Key Laboratory of Agricultural Electronic Commerce, Ministry of Agriculture, Hefei 230036, China
- Institute of Intelligent Agriculture, Anhui Agricultural University, Hefei 230036, China
| | - Ning Yang
- School of Information and Computer, Anhui Agricultural University, Hefei 230036, China
- Key Laboratory of Agricultural Electronic Commerce, Ministry of Agriculture, Hefei 230036, China
- Institute of Intelligent Agriculture, Anhui Agricultural University, Hefei 230036, China
| | - Hui Wang
- School of Information and Computer, Anhui Agricultural University, Hefei 230036, China
- Key Laboratory of Agricultural Electronic Commerce, Ministry of Agriculture, Hefei 230036, China
- Institute of Intelligent Agriculture, Anhui Agricultural University, Hefei 230036, China
| | - Chao Wang
- School of Information and Computer, Anhui Agricultural University, Hefei 230036, China
- Key Laboratory of Agricultural Electronic Commerce, Ministry of Agriculture, Hefei 230036, China
- Institute of Intelligent Agriculture, Anhui Agricultural University, Hefei 230036, China
| | - Jun Jiao
- School of Information and Computer, Anhui Agricultural University, Hefei 230036, China
- Key Laboratory of Agricultural Electronic Commerce, Ministry of Agriculture, Hefei 230036, China
- Institute of Intelligent Agriculture, Anhui Agricultural University, Hefei 230036, China
| | - Lichuan Gu
- School of Information and Computer, Anhui Agricultural University, Hefei 230036, China
- Key Laboratory of Agricultural Electronic Commerce, Ministry of Agriculture, Hefei 230036, China
- Institute of Intelligent Agriculture, Anhui Agricultural University, Hefei 230036, China
- School of Life Sciences, Anhui Agricultural University, Hefei 230036, China
| |
Collapse
|
6
|
Hubert CB, de Carvalho LPS. Metabolomic approaches for enzyme function and pathway discovery in bacteria. Methods Enzymol 2022; 665:29-47. [DOI: 10.1016/bs.mie.2021.12.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
7
|
Elhaj-Abdou MEM, El-Dib H, El-Helw A, El-Habrouk M. Deep_CNN_LSTM_GO: Protein function prediction from amino-acid sequences. Comput Biol Chem 2021; 95:107584. [PMID: 34601431 DOI: 10.1016/j.compbiolchem.2021.107584] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 09/08/2021] [Accepted: 09/21/2021] [Indexed: 11/15/2022]
Abstract
Protein amino acid sequences can be used to determine the functions of the protein. However, determining the function of a single protein requires many resources and a tremendous amount of time. Computational Intelligence methods such as Deep learning have been shown to predict the proteins' functions. This paper proposes a hybrid deep neural network model to predict an unknown protein's functions from sequences. The proposed model is named Deep_CNN_LSTM_GO. Deep_CNN_LSTM_GO is an Integration between Convolutional Neural network (CNN) and Long Short-Term Memory (LSTM) Neural Network to learn features from amino acid sequences and outputs the three different Gene Ontology (GO). The gene ontology represents the protein functions in the three sub-ontologies: Molecular Functions (MF), Biological Process (BP), and Cellular Component (CC). The proposed model has been trained and tested using UniProt-SwissProt's dataset. Another test has been done using Computational Assessment of Function Annotation (CAFA) on the three sub-ontologies. The proposed model outperforms different methods proposed in the field with better performance using three different evaluation metrics (Fmax, Smin, and AUPR) in the three sub-ontologies (MF, BP, CC).
Collapse
Affiliation(s)
- Mohamed E M Elhaj-Abdou
- Faculty of Engineering, Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt.
| | - Hassan El-Dib
- Faculty of Engineering, Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt.
| | - Amr El-Helw
- Faculty of Engineering, Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt.
| | | |
Collapse
|
8
|
Mota APZ, Fernandez D, Arraes FBM, Petitot AS, de Melo BP, de Sa MEL, Grynberg P, Saraiva MAP, Guimaraes PM, Brasileiro ACM, Albuquerque EVS, Danchin EGJ, Grossi-de-Sa MF. Evolutionarily conserved plant genes responsive to root-knot nematodes identified by comparative genomics. Mol Genet Genomics 2020; 295:1063-1078. [PMID: 32333171 DOI: 10.1007/s00438-020-01677-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 04/04/2020] [Indexed: 01/11/2023]
Abstract
Root-knot nematodes (RKNs, genus Meloidogyne) affect a large number of crops causing severe yield losses worldwide, more specifically in tropical and sub-tropical regions. Several plant species display high resistance levels to Meloidogyne, but a general view of the plant immune molecular responses underlying resistance to RKNs is still lacking. Combining comparative genomics with differential gene expression analysis may allow the identification of widely conserved plant genes involved in RKN resistance. To identify genes that are evolutionary conserved across plant species, we used OrthoFinder to compared the predicted proteome of 22 plant species, including important crops, spanning 214 Myr of plant evolution. Overall, we identified 35,238 protein orthogroups, of which 6,132 were evolutionarily conserved and universal to all the 22 plant species (PLAnts Common Orthogroups-PLACO). To identify host genes responsive to RKN infection, we analyzed the RNA-seq transcriptome data from RKN-resistant genotypes of a peanut wild relative (Arachis stenosperma), coffee (Coffea arabica L.), soybean (Glycine max L.), and African rice (Oryza glaberrima Steud.) challenged by Meloidogyne spp. using EdgeR and DESeq tools, and we found 2,597 (O. glaberrima), 743 (C. arabica), 665 (A. stenosperma), and 653 (G. max) differentially expressed genes (DEGs) during the resistance response to the nematode. DEGs' classification into the previously characterized 35,238 protein orthogroups allowed identifying 17 orthogroups containing at least one DEG of each resistant Arachis, coffee, soybean, and rice genotype analyzed. Orthogroups contain 364 DEGs related to signaling, secondary metabolite production, cell wall-related functions, peptide transport, transcription regulation, and plant defense, thus revealing evolutionarily conserved RKN-responsive genes. Interestingly, the 17 DEGs-containing orthogroups (belonging to the PLACO) were also universal to the 22 plant species studied, suggesting that these core genes may be involved in ancestrally conserved immune responses triggered by RKN infection. The comparative genomic approach that we used here represents a promising predictive tool for the identification of other core plant defense-related genes of broad interest that are involved in different plant-pathogen interactions.
Collapse
Affiliation(s)
- Ana Paula Zotta Mota
- EMBRAPA Recursos Genéticos e Biotecnologia, Brasília-DF, Brazil
- Departamento de Biologia Celular e Molecular, UFRGS, Porto Alegre-RS, Brazil
| | - Diana Fernandez
- EMBRAPA Recursos Genéticos e Biotecnologia, Brasília-DF, Brazil
- IRD, Cirad, Univ Montpellier, IPME, 911, Montpellier, France
| | - Fabricio B M Arraes
- EMBRAPA Recursos Genéticos e Biotecnologia, Brasília-DF, Brazil
- Departamento de Biologia Celular e Molecular, UFRGS, Porto Alegre-RS, Brazil
| | | | - Bruno Paes de Melo
- EMBRAPA Recursos Genéticos e Biotecnologia, Brasília-DF, Brazil
- Departamento de Bioquímica e Biologia Molecular/Bioagro, UFV, Viçosa-MG, Brazil
| | - Maria E Lisei de Sa
- EMBRAPA Recursos Genéticos e Biotecnologia, Brasília-DF, Brazil
- Empresa de Pesquisa Agropecuária de Minas Gerais, EPAMIG, Uberaba-MG, Brazil
| | | | | | | | | | | | | | - Maria Fatima Grossi-de-Sa
- EMBRAPA Recursos Genéticos e Biotecnologia, Brasília-DF, Brazil.
- Universidade Católica de Brasília, Brasília-DF, Brazil.
| |
Collapse
|
9
|
Investigation of machine learning techniques on proteomics: A comprehensive survey. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2019; 149:54-69. [PMID: 31568792 DOI: 10.1016/j.pbiomolbio.2019.09.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Revised: 09/16/2019] [Accepted: 09/23/2019] [Indexed: 11/21/2022]
Abstract
Proteomics is the extensive investigation of proteins which has empowered the recognizable proof of consistently expanding quantities of protein. Proteins are necessary part of living life form, with numerous capacities. The proteome is the complete arrangement of proteins that are created or altered by a life form or framework of the organism. Proteome fluctuates with time and unambiguous prerequisites, or stresses, that a cell or organism experiences. Proteomics is an interdisciplinary area that has derived from the hereditary data of different genome ventures. Much proteomics information is gathered with the assistance of high throughput techniques, for example, mass spectrometry and microarray. It would regularly take weeks or months to analyze the information and perform examinations by hand. Therefore, scholars and scientific experts are teaming up with computer science researchers and mathematicians to make projects and pipeline to computationally examine the protein information. Utilizing bioinformatics procedures, scientists are prepared to do quicker investigation and protein information storing. The goal of this paper is to brief about the review of machine learning procedures and its application in the field of proteomics.
Collapse
|
10
|
Saha S, Chatterjee P, Basu S, Nasipuri M, Plewczynski D. FunPred 3.0: improved protein function prediction using protein interaction network. PeerJ 2019; 7:e6830. [PMID: 31198622 PMCID: PMC6535044 DOI: 10.7717/peerj.6830] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2018] [Accepted: 03/21/2019] [Indexed: 11/23/2022] Open
Abstract
Proteins are the most versatile macromolecules in living systems and perform crucial biological functions. In the advent of the post-genomic era, the next generation sequencing is done routinely at the population scale for a variety of species. The challenging problem is to massively determine the functions of proteins that are yet not characterized by detailed experimental studies. Identification of protein functions experimentally is a laborious and time-consuming task involving many resources. We therefore propose the automated protein function prediction methodology using in silico algorithms trained on carefully curated experimental datasets. We present the improved protein function prediction tool FunPred 3.0, an extended version of our previous methodology FunPred 2, which exploits neighborhood properties in protein–protein interaction network (PPIN) and physicochemical properties of amino acids. Our method is validated using the available functional annotations in the PPIN network of Saccharomyces cerevisiae in the latest Munich information center for protein (MIPS) dataset. The PPIN data of S. cerevisiae in MIPS dataset includes 4,554 unique proteins in 13,528 protein–protein interactions after the elimination of the self-replicating and the self-interacting protein pairs. Using the developed FunPred 3.0 tool, we are able to achieve the mean precision, the recall and the F-score values of 0.55, 0.82 and 0.66, respectively. FunPred 3.0 is then used to predict the functions of unpredicted protein pairs (incomplete and missing functional annotations) in MIPS dataset of S. cerevisiae. The method is also capable of predicting the subcellular localization of proteins along with its corresponding functions. The code and the complete prediction results are available freely at: https://github.com/SovanSaha/FunPred-3.0.git.
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science and Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, Kolkata, West Bengal, India
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Dariusz Plewczynski
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland.,Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| |
Collapse
|