1
|
Tariq TB, Munir F, Jabeen I, Gul A, Amir R. Molecular modelling and gene expression analysis to probe the GT-γ trihelix transcription factors in Solanum tuberosum under drought stress. Sci Rep 2025; 15:12471. [PMID: 40216884 PMCID: PMC11992213 DOI: 10.1038/s41598-025-96485-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 03/28/2025] [Indexed: 04/14/2025] Open
Abstract
GT-γ transcription factors, a subfamily known for their involvement in stress responses, remain uncharacterized in Solanum tuberosum under drought stress. This study employed in-silico approaches and in-vitro expression profiling in differential tissues to investigate StGTγ-1, StGTγ-2, StGTγ-3, and StGTγ-4 potential role in the potato's drought tolerance mechanisms. Analysis of cis-regulatory elements showed complex networks controlling stress response. Alpha helices were prevalent in their structures, possibly aiding protein stability and interaction. Additionally, intrinsically disordered regions were observed in some StGT-γ proteins, suggesting their role in stress adaptation through flexibility. Protein structure modeling and validation revealed structural diversity within the GT-γ family, potentially reflecting variations in functionalities. Physicochemical analysis highlighted differences in protein properties that could influence their nuclear function. Post-translational modifications further diversified their functionalities. Subcellular localization prediction and topology analysis confirmed their nuclear localization, aligning with the anticipated role in transcriptional regulation. GT-γ proteins likely regulate genes due to structural variations. This is based on the presence of DNA-binding domains and functional annotation suggesting roles in metabolism, gene expression, and stress response. Molecular docking predicted partners involved in drought response, indicating GT-γ proteins' role in drought tolerance networks. Identified StGT-γ genes were highly expressed in leaves after 14 days of drought stress, indicating their key role in protecting this vulnerable tissue during drought. This study enhances understanding of GT-γ factors and provides a foundation for the functional characterization and in-depth exploration of the role and regulatory mechanisms of GT-γ genes in potato's response to drought stress.
Collapse
Affiliation(s)
- Tayyaba Bint Tariq
- Department of Agricultural Sciences and Technology, Atta-Ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Faiza Munir
- Department of Agricultural Sciences and Technology, Atta-Ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), Islamabad, Pakistan.
| | - Ishrat Jabeen
- School of Interdisciplinary Engineering and Sciences (SINES), National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Alvina Gul
- Department of Agricultural Sciences and Technology, Atta-Ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Rabia Amir
- Department of Agricultural Sciences and Technology, Atta-Ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), Islamabad, Pakistan
| |
Collapse
|
2
|
Li J, Chen X, Huang H, Zeng M, Yu J, Gong X, Ye Q. $\mathcal{S}$ able: bridging the gap in protein structure understanding with an empowering and versatile pre-training paradigm. Brief Bioinform 2025; 26:bbaf120. [PMID: 40163822 PMCID: PMC11957296 DOI: 10.1093/bib/bbaf120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2024] [Revised: 01/23/2025] [Accepted: 02/23/2025] [Indexed: 04/02/2025] Open
Abstract
Protein pre-training has emerged as a transformative approach for solving diverse biological tasks. While many contemporary methods focus on sequence-based language models, recent findings highlight that protein sequences alone are insufficient to capture the extensive information inherent in protein structures. Recognizing the crucial role of protein structure in defining function and interactions, we introduce $\mathcal{S}$able, a versatile pre-training model designed to comprehensively understand protein structures. $\mathcal{S}$able incorporates a novel structural encoding mechanism that enhances inter-atomic information exchange and spatial awareness, combined with robust pre-training strategies and lightweight decoders optimized for specific downstream tasks. This approach enables $\mathcal{S}$able to consistently outperform existing methods in tasks such as generation, classification, and regression, demonstrating its superior capability in protein structure representation. The code and models can be accessed via GitHub repository at https://github.com/baaihealth/Sable.
Collapse
Affiliation(s)
- Jiashan Li
- Institute for Mathematical Sciences, Renmin University of China, 59 Zhongguancun Street, Beijing 100872, China
| | - Xi Chen
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - He Huang
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Mingliang Zeng
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Jingcheng Yu
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Xinqi Gong
- Institute for Mathematical Sciences, Renmin University of China, 59 Zhongguancun Street, Beijing 100872, China
| | - Qiwei Ye
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| |
Collapse
|
3
|
Yadav DK, Srivastava GP, Singh A, Singh M, Yadav N, Tuteja N. Proteome-wide analysis reveals G protein-coupled receptor-like proteins in rice ( Oryza sativa). PLANT SIGNALING & BEHAVIOR 2024; 19:2365572. [PMID: 38904257 PMCID: PMC11195488 DOI: 10.1080/15592324.2024.2365572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 06/04/2024] [Indexed: 06/22/2024]
Abstract
G protein-coupled receptors (GPCRs) constitute the largest family of transmembrane proteins in metazoans that mediate the regulation of various physiological responses to discrete ligands through heterotrimeric G protein subunits. The existence of GPCRs in plant is contentious, but their comparable crucial role in various signaling pathways necessitates the identification of novel remote GPCR-like proteins that essentially interact with the plant G protein α subunit and facilitate the transduction of various stimuli. In this study, we identified three putative GPCR-like proteins (OsGPCRLPs) (LOC_Os06g09930.1, LOC_Os04g36630.1, and LOC_Os01g54784.1) in the rice proteome using a stringent bioinformatics workflow. The identified OsGPCRLPs exhibited a canonical GPCR 'type I' 7TM topology, patterns, and biologically significant sites for membrane anchorage and desensitization. Cluster-based interactome mapping revealed that the identified proteins interact with the G protein α subunit which is a characteristic feature of GPCRs. Computational results showing the interaction of identified GPCR-like proteins with G protein α subunit and its further validation by the membrane yeast-two-hybrid assay strongly suggest the presence of GPCR-like 7TM proteins in the rice proteome. The absence of a regulator of G protein signaling (RGS) box in the C- terminal domain, and the presence of signature motifs of canonical GPCR in the identified OsGPCRLPs strongly suggest that the rice proteome contains GPCR-like proteins that might be involved in signal transduction.
Collapse
Affiliation(s)
- Dinesh K. Yadav
- Plant Molecular Biology and Genetic Engineering Laboratory, Department of Botany, University of Allahabad, Prayagraj, India
| | - Gyan Prakash Srivastava
- Plant Molecular Biology and Genetic Engineering Laboratory, Department of Botany, University of Allahabad, Prayagraj, India
| | - Ananya Singh
- Plant Molecular Biology and Genetic Engineering Laboratory, Department of Botany, University of Allahabad, Prayagraj, India
| | - Madhavi Singh
- Plant Molecular Biology and Genetic Engineering Laboratory, Department of Botany, University of Allahabad, Prayagraj, India
| | - Neelam Yadav
- Plant Molecular Biology and Genetic Engineering Laboratory, Department of Botany, University of Allahabad, Prayagraj, India
| | - Narendra Tuteja
- Plant Molecular Biology, International Centre for Genetic Engineering and Biotechnology, New Delhi, India
| |
Collapse
|
4
|
Ahmad EM, Abdelsamad A, El-Shabrawi HM, El-Awady MAM, Aly MAM, El-Soda M. In-silico identification of putatively functional intergenic small open reading frames in the cucumber genome and their predicted response to biotic and abiotic stresses. PLANT, CELL & ENVIRONMENT 2024; 47:5330-5342. [PMID: 39189930 DOI: 10.1111/pce.15104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/07/2024] [Revised: 07/13/2024] [Accepted: 08/10/2024] [Indexed: 08/28/2024]
Abstract
The availability of high-throughput sequencing technologies increased our understanding of different genomes. However, the genomes of all living organisms still have many unidentified coding sequences. The increased number of missing small open reading frames (sORFs) is due to the length threshold used in most gene identification tools, which is true in the genic and, more importantly and surprisingly, in the intergenic regions. Scanning the cucumber genome intergenic regions revealed 420 723 sORF. We excluded 3850 sORF with similarities to annotated cucumber proteins. To propose the functionality of the remaining 416 873 sORF, we calculated their codon adaptation index (CAI). We found 398 937 novel sORF (nsORF) with CAI ≥ 0.7 that were further used for downstream analysis. Searching against the Rfam database revealed 109 nsORFs similar to multiple RNA families. Using SignalP-5.0 and NLS, identified 11 592 signal peptides. Five predicted proteins interacting with Meloidogyne incognita and Powdery mildew proteins were selected using published transcriptome data of host-pathogen interactions. Gene ontology enrichment interpreted the function of those proteins, illustrating that nsORFs' expression could contribute to the cucumber's response to biotic and abiotic stresses. This research highlights the importance of previously overlooked nsORFs in the cucumber genome and provides novel insights into their potential functions.
Collapse
Affiliation(s)
- Esraa M Ahmad
- Department of Genetics, Faculty of Agriculture, Cairo University, Giza, Egypt
| | - Ahmed Abdelsamad
- Department of Genetics, Faculty of Agriculture, Cairo University, Giza, Egypt
| | - Hattem M El-Shabrawi
- Plant Biotechnology Department, Genetic Engineering & Biotechnology Division, National Research Center, Giza, Egypt
| | | | - Mohammed A M Aly
- Department of Genetics, Faculty of Agriculture, Cairo University, Giza, Egypt
| | - Mohamed El-Soda
- Department of Genetics, Faculty of Agriculture, Cairo University, Giza, Egypt
| |
Collapse
|
5
|
Li C, Luo Y, Xie Y, Zhang Z, Liu Y, Zou L, Xiao F. Structural and functional prediction, evaluation, and validation in the post-sequencing era. Comput Struct Biotechnol J 2024; 23:446-451. [PMID: 38223342 PMCID: PMC10787220 DOI: 10.1016/j.csbj.2023.12.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 12/20/2023] [Accepted: 12/22/2023] [Indexed: 01/16/2024] Open
Abstract
The surge of genome sequencing data has underlined substantial genetic variants of uncertain significance (VUS). The decryption of VUS discovered by sequencing poses a major challenge in the post-sequencing era. Although experimental assays have progressed in classifying VUS, only a tiny fraction of the human genes have been explored experimentally. Thus, it is urgently needed to generate state-of-the-art functional predictors of VUS in silico. Artificial intelligence (AI) is an invaluable tool to assist in the identification of VUS with high efficiency and accuracy. An increasing number of studies indicate that AI has brought an exciting acceleration in the interpretation of VUS, and our group has already used AI to develop protein structure-based prediction models. In this review, we provide an overview of the previous research on AI-based prediction of missense variants, and elucidate the challenges and opportunities for protein structure-based variant prediction in the post-sequencing era.
Collapse
Affiliation(s)
- Chang Li
- Clinical Biobank, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Yixuan Luo
- Beijing Normal University, Beijing, China
| | - Yibo Xie
- Information Center, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Zaifeng Zhang
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Ye Liu
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Lihui Zou
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Fei Xiao
- Clinical Biobank, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- Beijing Normal University, Beijing, China
| |
Collapse
|
6
|
Breimann S, Kamp F, Steiner H, Frishman D. AAontology: An Ontology of Amino Acid Scales for Interpretable Machine Learning. J Mol Biol 2024; 436:168717. [PMID: 39053689 DOI: 10.1016/j.jmb.2024.168717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Revised: 07/15/2024] [Accepted: 07/19/2024] [Indexed: 07/27/2024]
Abstract
Amino acid scales are crucial for protein prediction tasks, many of them being curated in the AAindex database. Despite various clustering attempts to organize them and to better understand their relationships, these approaches lack the fine-grained classification necessary for satisfactory interpretability in many protein prediction problems. To address this issue, we developed AAontology-a two-level classification for 586 amino acid scales (mainly from AAindex) together with an in-depth analysis of their relations-using bag-of-word-based classification, clustering, and manual refinement over multiple iterations. AAontology organizes physicochemical scales into 8 categories and 67 subcategories, enhancing the interpretability of scale-based machine learning methods in protein bioinformatics. Thereby it enables researchers to gain a deeper biological insight. We anticipate that AAontology will be a building block to link amino acid properties with protein function and dysfunctions as well as aid informed decision-making in mutation analysis or protein drug design.
Collapse
Affiliation(s)
- Stephan Breimann
- Department of Bioinformatics, School of Life Sciences, Technical University of Munich, Freising, Germany; Ludwig-Maximilians-University Munich, Biomedical Center, Division of Metabolic Biochemistry, Munich, Germany; German Center for Neurodegenerative Diseases (DZNE), Munich, Germany
| | - Frits Kamp
- Ludwig-Maximilians-University Munich, Biomedical Center, Division of Metabolic Biochemistry, Munich, Germany
| | - Harald Steiner
- Ludwig-Maximilians-University Munich, Biomedical Center, Division of Metabolic Biochemistry, Munich, Germany; German Center for Neurodegenerative Diseases (DZNE), Munich, Germany
| | - Dmitrij Frishman
- Department of Bioinformatics, School of Life Sciences, Technical University of Munich, Freising, Germany.
| |
Collapse
|
7
|
Buchan DWA, Moffat L, Lau A, Kandathil S, Jones D. Deep learning for the PSIPRED Protein Analysis Workbench. Nucleic Acids Res 2024; 52:W287-W293. [PMID: 38747351 PMCID: PMC11223827 DOI: 10.1093/nar/gkae328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Revised: 04/08/2024] [Accepted: 04/24/2024] [Indexed: 07/06/2024] Open
Abstract
The PSIRED Workbench is a long established and popular bioinformatics web service offering a wide range of machine learning based analyses for characterizing protein structure and function. In this paper we provide an update of the recent additions and developments to the webserver, with a focus on new Deep Learning based methods. We briefly discuss some trends in server usage since the publication of AlphaFold2 and we give an overview of some upcoming developments for the service. The PSIPRED Workbench is available at http://bioinf.cs.ucl.ac.uk/psipred.
Collapse
Affiliation(s)
- Daniel W A Buchan
- UCL Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Lewis Moffat
- UCL Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Andy Lau
- UCL Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Shaun M Kandathil
- UCL Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - David T Jones
- UCL Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, UK
| |
Collapse
|
8
|
Ye B, Liang J. Predicting Functional Surface Topographies Combining Topological Data Analysis and Deep Learning Across the Human Protein Universe. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2024; 2024:1-4. [PMID: 40039158 DOI: 10.1109/embc53108.2024.10782681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Characterizing geometric and topological properties of protein structures encompassing surface pockets, interior cavities, and cross channels is important for understanding their functions. Our knowledge of protein structures has been greatly advanced by AI-powered structure prediction tools, with AlphaFold2 (AF2) providing accurate 3D structure predictions for most protein sequences. Nonetheless, there is a substantial lack of function annotations and corresponding functional surface topographical information. We develop a method to predict functional pockets, along with their associated Gene Ontology (GO) terms and Enzyme Commission (EC) numbers, for a set of 65,013 AF2-predicted human non-singleton representative structures, which can be mapped to 186,095 "non-fragment" AF2-predicted human protein structures. The identification of functional pockets, along with their respective GO terms and EC numbers, is achieved by combining topological data analysis and the deep learning method of DeepFRI. All predicted functional pockets for these 65,013 AF2-predicted human representative structures are accessible at: https://cfold.bme.uic.edu/castpfold.
Collapse
|
9
|
Duo H, Chhabra R, Muthusamy V, Zunjare RU, Hossain F. Assessing sequence variation, haplotype analysis and molecular characterisation of aspartate kinase2 (ask2) gene regulating methionine biosynthesis in diverse maize inbreds. Mol Genet Genomics 2024; 299:7. [PMID: 38349549 DOI: 10.1007/s00438-024-02096-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2023] [Accepted: 11/02/2023] [Indexed: 02/15/2024]
Abstract
Traditional maize grain is deficient in methionine, an essential amino acid required for proper growth and development in humans and poultry birds. Thus, development of high methionine maize (HMM) assumes great significance in alleviating malnutrition through sustainable and cost-effective approach. Of various genetic loci, aspartate kinase2 (ask2) gene plays a pivotal role in regulating methionine accumulation in maize. Here, we sequenced the entire ask2 gene of 5394 bp with 13 exons in five wild and five mutant maize inbreds to understand variation at nucleotide level. Sequence analysis revealed that an SNP in exon-13 caused thymine to adenine transversion giving rise to a favourable mutant allele associated with leucine to glutamine substitution in mutant ASK2 protein. Gene-based diversity analysis with 11 InDel markers grouped 48 diverse inbreds into three major clusters with an average genetic dissimilarity of 0.570 (range, 0.0-0.9). The average major allele frequency, gene diversity and PIC are 0.693, 0.408 and 0.341, respectively. A total of 45 haplotypes of the ask2 gene were identified among the maize inbreds. Evolutionary relationship analysis performed among 22 orthologues grouped them into five major clusters. The number of exons varied from 7 to 17, with length varying from 12 to 495 bp among orthologues. ASK2 protein with 565 amino acids was predicted to be in homo-dimeric state with lysine and tartaric acid as binding ligands. Amino acid kinase and ACT domains were found to be conserved in maize and orthologues. The study depicted the presence of enough genetic diversity in ask2 gene in maize, and development of HMM can be accelerated through introgression of favourable allele of ask2 into the parental lines of elite hybrids using molecular breeding.
Collapse
Affiliation(s)
- Hriipulou Duo
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Rashmi Chhabra
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | | | | | - Firoz Hossain
- ICAR-Indian Agricultural Research Institute, New Delhi, India.
| |
Collapse
|
10
|
Hannon Bozorgmehr J. Four classic "de novo" genes all have plausible homologs and likely evolved from retro-duplicated or pseudogenic sequences. Mol Genet Genomics 2024; 299:6. [PMID: 38315248 DOI: 10.1007/s00438-023-02090-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 10/15/2023] [Indexed: 02/07/2024]
Abstract
Despite being previously regarded as extremely unlikely, the idea that entirely novel protein-coding genes can emerge from non-coding sequences has gradually become accepted over the past two decades. Examples of "de novo origination", resulting in lineage-specific "orphan" genes, lacking coding orthologs, are now produced every year. However, many are likely cases of duplicates that are difficult to recognize. Here, I re-examine the claims and show that four very well-known examples of genes alleged to have emerged completely "from scratch"- FLJ33706 in humans, Goddard in fruit flies, BSC4 in baker's yeast and AFGP2 in codfish-may have plausible evolutionary ancestors in pre-existing genes. The first two are likely highly diverged retrogenes coding for regulatory proteins that have been misidentified as orphans. The antifreeze glycoprotein, moreover, may not have evolved from repetitive non-genic sequences but, as in several other related cases, from an apolipoprotein that could have become pseudogenized before later being reactivated. These findings detract from various claims made about de novo gene birth and show there has been a tendency not to invest the necessary effort in searching for homologs outside of a very limited syntenic or phylostratigraphic methodology. A robust approach is used for improving detection that draws upon similarities, not just in terms of statistical sequence analysis, but also relating to biochemistry and function, to obviate notable failures to identify homologs.
Collapse
|
11
|
Dagher SF, Vaishnav A, Stanley CB, Meilleur F, Edwards BFP, Bruno-Bárcena JM. Structural analysis and functional evaluation of the disordered ß-hexosyltransferase region from Hamamotoa (Sporobolomyces) singularis. Front Bioeng Biotechnol 2023; 11:1291245. [PMID: 38162180 PMCID: PMC10755861 DOI: 10.3389/fbioe.2023.1291245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Accepted: 11/16/2023] [Indexed: 01/03/2024] Open
Abstract
Hamamotoa (Sporobolomyces) singularis codes for an industrially important membrane bound ß-hexosyltransferase (BHT), (BglA, UniprotKB: Q564N5) that has applications in the production of natural fibers such as galacto-oligosaccharides (GOS) and natural sugars found in human milk. When heterologously expressed by Komagataella phaffii GS115, BHT is found both membrane bound and soluble secreted into the culture medium. In silico structural predictions and crystal structures support a glycosylated homodimeric enzyme and the presence of an intrinsically disordered region (IDR) with membrane binding potential within its novel N-terminal region (1-110 amino acids). Additional in silico analysis showed that the IDR may not be essential for stable homodimerization. Thus, we performed progressive deletion analyses targeting segments within the suspected disordered region, to determine the N-terminal disorder region's impact on the ratio of membrane-bound to secreted soluble enzyme and its contribution to enzyme activity. The ratio of the soluble secreted to membrane-bound enzyme shifted from 40% to 53% after the disordered N-terminal region was completely removed, while the specific activity was unaffected. Furthermore, functional analysis of each glycosylation site found within the C-terminal domain revealed reduced total secreted protein activity by 58%-97% in both the presence and absence of the IDR, indicating that glycosylation at all four locations is required by the host for the secretion of active enzyme and independent of the removed disordered N-terminal region. Overall, the data provides evidence that the disordered region only partially influences the secretion and membrane localization of BHT.
Collapse
Affiliation(s)
- Suzanne F. Dagher
- Department of Plant and Microbial Biology, North Carolina State University, Raleigh, NC, United States
| | - Asmita Vaishnav
- Department of Biochemistry, Microbiology and Immunology, Wayne State University, Detroit, MI, United States
| | | | - Flora Meilleur
- Neutron Sciences Directorate, Oak Ridge National Laboratory, Oak Ridge, TN, United States
- Department of Molecular and Structural Biochemistry, North Carolina State University, Raleigh, NC, United States
| | - Brian F. P. Edwards
- Department of Biochemistry, Microbiology and Immunology, Wayne State University, Detroit, MI, United States
| | - José M. Bruno-Bárcena
- Department of Plant and Microbial Biology, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
12
|
Krishnamoorthy S, Muruganantham B, Yu JR, Park WY, Muthusami S. Exploring the utility of FTS as a bonafide binding partner for EGFR: A potential drug target for cervical cancer. Comput Biol Med 2023; 167:107592. [PMID: 37976824 DOI: 10.1016/j.compbiomed.2023.107592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 09/25/2023] [Accepted: 10/17/2023] [Indexed: 11/19/2023]
Abstract
Establishment of human papilloma virus (HPV) infection and its progression to cervical cancer (CC) requires the participation of epidermal growth factor (EGF) receptor (EGFR) and fused toes homolog (FTS). This review is an attempt to understand the structure-function relationship between FTS and EGFR as a tool for the development of newer CC drugs. Motif analysis was performed using national center for biotechnology information (NCBI), kyoto encyclopedia of genes and genomes (KEGG), simple modular architecture research tool (SMART) and multiple expectation maximizations for motif elicitation (MEME) database. The secondary and tertiary structure prediction of FTS was performed using DISOPRED3 and threading assembly, respectively. A positive correlation was found between the transcript levels of FTS and EGFR. Amino acids responsible for interaction between EGFR and FTS were determined. The nine micro-RNAs (miRNAs) that regulates the expression of FTS were predicted using Network Analyst 3.0 database. hsa-miR-629-5p and hsa-miR-615-3p are identified as significant positive and negative regulators of FTS gene expression. This review opens up new avenues for the development of CC drugs which interfere with the interaction between FTS and EGFR.
Collapse
Affiliation(s)
- Sneha Krishnamoorthy
- Department of Biochemistry, Karpagam Academy of Higher Education, Coimbatore, 641021, Tamil Nadu, India
| | - Bharathi Muruganantham
- Centre for Cancer Research, Karpagam Academy of Higher Education, Coimbatore, 641021, Tamil Nadu, India
| | - Jae-Ran Yu
- Department of Environmental and Tropical Medicine, Konkuk University College of Medicine, Chungju, South Korea
| | - Woo-Yoon Park
- Department of Radiation Oncology Hospital, College of Medicine, Chungbuk National University, Cheongju, South Korea.
| | - Sridhar Muthusami
- Department of Biochemistry, Karpagam Academy of Higher Education, Coimbatore, 641021, Tamil Nadu, India; Centre for Cancer Research, Karpagam Academy of Higher Education, Coimbatore, 641021, Tamil Nadu, India.
| |
Collapse
|
13
|
Chen J, Gu Z, Lai L, Pei J. In silico protein function prediction: the rise of machine learning-based approaches. MEDICAL REVIEW (2021) 2023; 3:487-510. [PMID: 38282798 PMCID: PMC10808870 DOI: 10.1515/mr-2023-0038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 10/11/2023] [Indexed: 01/30/2024]
Abstract
Proteins function as integral actors in essential life processes, rendering the realm of protein research a fundamental domain that possesses the potential to propel advancements in pharmaceuticals and disease investigation. Within the context of protein research, an imperious demand arises to uncover protein functionalities and untangle intricate mechanistic underpinnings. Due to the exorbitant costs and limited throughput inherent in experimental investigations, computational models offer a promising alternative to accelerate protein function annotation. In recent years, protein pre-training models have exhibited noteworthy advancement across multiple prediction tasks. This advancement highlights a notable prospect for effectively tackling the intricate downstream task associated with protein function prediction. In this review, we elucidate the historical evolution and research paradigms of computational methods for predicting protein function. Subsequently, we summarize the progress in protein and molecule representation as well as feature extraction techniques. Furthermore, we assess the performance of machine learning-based algorithms across various objectives in protein function prediction, thereby offering a comprehensive perspective on the progress within this field.
Collapse
Affiliation(s)
- Jiaxiao Chen
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Zhonghui Gu
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Luhua Lai
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, China
- Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014), Beijing, China
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014), Beijing, China
| |
Collapse
|
14
|
Hartke J, Ceron-Noriega A, Stoldt M, Sistermans T, Kever M, Fuchs J, Butter F, Foitzik S. Long live the host! Proteomic analysis reveals possible strategies for parasitic manipulation of its social host. Mol Ecol 2023; 32:5877-5889. [PMID: 37795937 DOI: 10.1111/mec.17155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 09/15/2023] [Accepted: 09/19/2023] [Indexed: 10/06/2023]
Abstract
Parasites with complex life cycles often manipulate the phenotype of their intermediate hosts to increase the probability of transmission to their definitive hosts. Infection with Anomotaenia brevis, a cestode that uses Temnothorax nylanderi ants as intermediate hosts, leads to a multiple-fold extension of host lifespan and to changes in behaviour, morphology and colouration. The mechanisms behind these changes are unknown, as is whether the increased longevity is achieved through parasite manipulation. Here, we demonstrate that the parasite releases proteins into its host with functions that might explain the observed changes. These parasitic proteins make up a substantial portion of the proteome of the hosts' haemolymph, and thioredoxin peroxidase and superoxide dismutase, two antioxidants, exhibited the highest abundances among them. The largest part of the secreted proteins could not be annotated, indicating they are either novel or severely altered during recent coevolution to function in host manipulation. We also detected shifts in the hosts' proteome with infection, in particular an overabundance of vitellogenin-like A in infected ants, a protein that regulates division of labour in Temnothorax ants, which could explain the observed behavioural changes. Our results thus suggest two different strategies that might be employed by this parasite to manipulate its host: secreting proteins with immediate influence on the host's phenotype and altering the host's translational activity. Our findings highlight the intricate molecular interplay required to influence the phenotype of a host and point to potential signalling pathways and genes involved in parasite-host communication.
Collapse
Affiliation(s)
- Juliane Hartke
- Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Mainz, Germany
| | | | - Marah Stoldt
- Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Tom Sistermans
- Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Marion Kever
- Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Jenny Fuchs
- Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Falk Butter
- Institute of Molecular Biology, Mainz, Germany
| | - Susanne Foitzik
- Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
15
|
Li W, Yu Y, Zhou G, Hu G, Li B, Ma H, Yan W, Pei H. Large-scale ORF screening based on LC-MS to discover novel lncRNA-encoded peptides responding to ionizing radiation and microgravity. Comput Struct Biotechnol J 2023; 21:5201-5211. [PMID: 37928948 PMCID: PMC10624585 DOI: 10.1016/j.csbj.2023.10.040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2023] [Revised: 10/12/2023] [Accepted: 10/18/2023] [Indexed: 11/07/2023] Open
Abstract
In the human genome, 98% of genes can be transcribed into non-coding RNAs (ncRNAs), among which lncRNAs and their encoded peptides play important roles in regulating various aspects of cellular processes and may serve as crucial factors in modulating the biological effects induced by ionizing radiation and microgravity. Unfortunately, there are few reports in space radiation biology on lncRNA-encoded peptides below 10kD due to limitations in detection techniques. To fill this gap, we integrated a variety of methods based on genomics and peptidomics, and discovered 22 lncRNA-encoded small peptides that are sensitive to space radiation and microgravity, which have never been reported before. We concurrently validated the transmembrane helix, subcellular localization, and biological function of these small peptides using bioinformatics and molecular biology techniques. More importantly, we found that these small peptides function independently of the lncRNAs that encode them. Our findings have uncovered a previously unknown human proteome encoded by 'non-coding' genes in response to space conditions and elucidated their involvement in biological processes, providing valuable strategies for individual protection mechanisms for astronauts who carry out deep space exploration missions in space radiation environments.
Collapse
Affiliation(s)
- Wanshi Li
- State Key Laboratory of Radiation Medicine and Protection, School of Radiation Medicine and Protection, Collaborative Innovation Center of Radiological Medicine of Jiangsu Higher Education Institutions, Soochow University, Suzhou 215123, China
| | - Yongduo Yu
- State Key Laboratory of Radiation Medicine and Protection, School of Radiation Medicine and Protection, Collaborative Innovation Center of Radiological Medicine of Jiangsu Higher Education Institutions, Soochow University, Suzhou 215123, China
| | - Guangming Zhou
- State Key Laboratory of Radiation Medicine and Protection, School of Radiation Medicine and Protection, Collaborative Innovation Center of Radiological Medicine of Jiangsu Higher Education Institutions, Soochow University, Suzhou 215123, China
| | - Guang Hu
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Suzhou 215123, China
| | - Bingyan Li
- State Key Laboratory of Radiation Medicine and Protection, School of Radiation Medicine and Protection, Collaborative Innovation Center of Radiological Medicine of Jiangsu Higher Education Institutions, Soochow University, Suzhou 215123, China
| | - Hong Ma
- Beijing Key Laboratory for Separation and Analysis in Biomedicine and Pharmaceuticals, School of Life Science, Beijing Institute of Technology, Beijing 100081, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Suzhou 215123, China
| | - Hailong Pei
- State Key Laboratory of Radiation Medicine and Protection, School of Radiation Medicine and Protection, Collaborative Innovation Center of Radiological Medicine of Jiangsu Higher Education Institutions, Soochow University, Suzhou 215123, China
| |
Collapse
|
16
|
Boadu F, Cao H, Cheng J. Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Bioinformatics 2023; 39:i318-i325. [PMID: 37387145 DOI: 10.1093/bioinformatics/btad208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently. RESULTS We developed TransFun-a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy. AVAILABILITY AND IMPLEMENTATION The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, United States
| | - Hongyuan Cao
- Department of Statistics, Florida State University, Tallahassee, FL 32306, Unites States
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, United States
| |
Collapse
|
17
|
Wang Z, Deng Z, Zhang W, Lou Q, Choi KS, Wei Z, Wang L, Wu J. MMSMAPlus: a multi-view multi-scale multi-attention embedding model for protein function prediction. Brief Bioinform 2023:7187109. [PMID: 37258453 DOI: 10.1093/bib/bbad201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 04/16/2023] [Accepted: 05/08/2023] [Indexed: 06/02/2023] Open
Abstract
Protein is the most important component in organisms and plays an indispensable role in life activities. In recent years, a large number of intelligent methods have been proposed to predict protein function. These methods obtain different types of protein information, including sequence, structure and interaction network. Among them, protein sequences have gained significant attention where methods are investigated to extract the information from different views of features. However, how to fully exploit the views for effective protein sequence analysis remains a challenge. In this regard, we propose a multi-view, multi-scale and multi-attention deep neural model (MMSMA) for protein function prediction. First, MMSMA extracts multi-view features from protein sequences, including one-hot encoding features, evolutionary information features, deep semantic features and overlapping property features based on physiochemistry. Second, a specific multi-scale multi-attention deep network model (MSMA) is built for each view to realize the deep feature learning and preliminary classification. In MSMA, both multi-scale local patterns and long-range dependence from protein sequences can be captured. Third, a multi-view adaptive decision mechanism is developed to make a comprehensive decision based on the classification results of all the views. To further improve the prediction performance, an extended version of MMSMA, MMSMAPlus, is proposed to integrate homology-based protein prediction under the framework of multi-view deep neural model. Experimental results show that the MMSMAPlus has promising performance and is significantly superior to the state-of-the-art methods. The source code can be found at https://github.com/wzy-2020/MMSMAPlus.
Collapse
Affiliation(s)
- Zhongyu Wang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Wei Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Qiongdan Lou
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | | | - Zhisheng Wei
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| | - Lei Wang
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| | - Jing Wu
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| |
Collapse
|
18
|
Boadu F, Cao H, Cheng J. Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.17.524477. [PMID: 36711471 PMCID: PMC9882282 DOI: 10.1101/2023.01.17.524477] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Motivation Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently. Results We developed TransFun - a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy. Availability The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun. Contact chengji@missouri.edu.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Hongyuan Cao
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA,Contact: To whom correspondence should be addressed.
| |
Collapse
|
19
|
Systems biology's role in leveraging microalgal biomass potential: Current status and future perspectives. ALGAL RES 2022. [DOI: 10.1016/j.algal.2022.102963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
20
|
Zhu YH, Zhang C, Yu DJ, Zhang Y. Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction. PLoS Comput Biol 2022; 18:e1010793. [PMID: 36548439 PMCID: PMC9822105 DOI: 10.1371/journal.pcbi.1010793] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 01/06/2023] [Accepted: 12/05/2022] [Indexed: 12/24/2022] Open
Abstract
Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, People’s Republic of China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, People’s Republic of China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
21
|
Gupta N, Reddy K, Gnanasekaran P, Zhai Y, Chakraborty S, Pappu HR. Functional characterization of a new ORF βV1 encoded by radish leaf curl betasatellite. FRONTIERS IN PLANT SCIENCE 2022; 13:972386. [PMID: 36212370 PMCID: PMC9546537 DOI: 10.3389/fpls.2022.972386] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Accepted: 08/10/2022] [Indexed: 05/26/2023]
Abstract
Whitefly-transmitted begomoviruses infect and damage a wide range of food, feed, and fiber crops worldwide. Some of these viruses are associated with betasatellite molecules that are known to enhance viral pathogenesis. In this study, we investigated the function of a novel βV1 protein encoded by radish leaf curl betasatellite (RaLCB) by overexpressing the protein using potato virus X (PVX)-based virus vector in Nicotiana benthamiana. βV1 protein induced lesions on leaves, suggestive of hypersensitive response (HR), indicating cell death. The HR reaction induced by βV1 protein was accompanied by an increased accumulation of reactive oxygen species (ROS), free radicals, and HR-related transcripts. Subcellular localization through confocal microscopy revealed that βV1 protein localizes to the cellular periphery. βV1 was also found to interact with replication enhancer protein (AC3) of helper virus in the nucleus. The current findings suggest that βV1 functions as a protein elicitor and a pathogenicity determinant.
Collapse
Affiliation(s)
- Neha Gupta
- Molecular Virology Laboratory, School of Life Sciences, Jawaharlal Nehru University, New Delhi, India
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
| | - Kishorekumar Reddy
- Molecular Virology Laboratory, School of Life Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Prabu Gnanasekaran
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
| | - Ying Zhai
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
| | - Supriya Chakraborty
- Molecular Virology Laboratory, School of Life Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Hanu R. Pappu
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
| |
Collapse
|
22
|
Ramola R, Friedberg I, Radivojac P. The field of protein function prediction as viewed by different domain scientists. BIOINFORMATICS ADVANCES 2022; 2:vbac057. [PMID: 36699361 PMCID: PMC9710704 DOI: 10.1093/bioadv/vbac057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 08/14/2022] [Indexed: 01/28/2023]
Abstract
Motivation Experimental biologists, biocurators, and computational biologists all play a role in characterizing a protein's function. The discovery of protein function in the laboratory by experimental scientists is the foundation of our knowledge about proteins. Experimental findings are compiled in knowledgebases by biocurators to provide standardized, readily accessible, and computationally amenable information. Computational biologists train their methods using these data to predict protein function and guide subsequent experiments. To understand the state of affairs in this ecosystem, centered here around protein function prediction, we surveyed scientists from these three constituent communities. Results We show that the three communities have common but also idiosyncratic perspectives on the field. Most strikingly, experimentalists rarely use state-of-the-art prediction software, but when presented with predictions, report many to be surprising and useful. Ontologies appear to be highly valued by biocurators, less so by experimentalists and computational biologists, yet controlled vocabularies bridge the communities and simplify the prediction task. Additionally, many software tools are not readily accessible and the predictions presented to the users can be broad and uninformative. We conclude that to meet both the social and technical challenges in the field, a more productive and meaningful interaction between members of the core communities is necessary. Availability and implementation Data cannot be shared for ethical/privacy reasons. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Rashika Ramola
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | | | | |
Collapse
|
23
|
Almutairi ZM. In Silico Identification and Characterization of B12D Family Proteins in Viridiplantae. Evol Bioinform Online 2022; 18:11769343221106795. [PMID: 35721582 PMCID: PMC9201304 DOI: 10.1177/11769343221106795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Accepted: 05/12/2022] [Indexed: 11/16/2022] Open
Abstract
B12D family proteins are transmembrane proteins that contain the B12D
domain involved in membrane trafficking. Plants comprise several
members of the B12D family, but these members’ numbers and specific
functions are not determined. This study aims to identify and
characterize the members of B12D protein family in plants. Phytozome
database was retrieved for B12D proteins from 14 species. The total 66
B12D proteins were analyzed in silico for gene structure, motifs, gene
expression, duplication events, and phylogenetics. In general, B12D
proteins are between 86 and 98 aa in length, have 2 or 3 exons, and
comprise a single transmembrane helix. Motif prediction and multiple
sequence alignment show strong conservation among B12D proteins of 11
flowering plants species. Despite that, the phylogenetic tree revealed
a distinct cluster of 16 B12D proteins that have high conservation
across flowering plants. Motif prediction revealed 41 aa motif
conserved in 58 of the analyzed B12D proteins similar to the bZIP
motif, confirming that in the predicted biological process and
molecular function, B12D proteins are DNA-binding proteins.
Cis-regulatory elements screening in putative
B12D promoters found various responsive
elements for light, abscisic acid, methyl jasmonate, cytokinin,
drought, and heat. Despite that, there is specific elements for cold
stress, cell cycle, circadian, auxin, salicylic acid, and gibberellic
acid in the promoter of a few B12D genes indicating
for functional diversification for B12D family members. The digital
expression shows that B12D genes of Glycine
max have similar expression patterns consistent with
their clustering in the phylogenetic tree. However, the expression of
B12D genes of Hordeum vulgure
appears inconsistent with their clustering in the tree. Despite the
strong conservation of the B12D proteins of Viridiplantae, gene
association analysis, promoter analysis, and digital expression
indicate different roles for the members of the B12D family during
plant developmental stages.
Collapse
Affiliation(s)
- Zainab M Almutairi
- Department of Biology, College of Science and Humanities in Al-Kharj, Prince Sattam bin Abdulaziz University, Al-kharj, Saudi Arabia
| |
Collapse
|
24
|
Lai B, Xu J. Accurate protein function prediction via graph attention networks with predicted structure information. Brief Bioinform 2022; 23:bbab502. [PMID: 34882195 PMCID: PMC8898000 DOI: 10.1093/bib/bbab502] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2021] [Revised: 10/13/2021] [Accepted: 11/02/2021] [Indexed: 12/27/2022] Open
Abstract
Experimental protein function annotation does not scale with the fast-growing sequence databases. Only a tiny fraction (<0.1%) of protein sequences has experimentally determined functional annotations. Computational methods may predict protein function very quickly, but their accuracy is not very satisfactory. Based upon recent breakthroughs in protein structure prediction and protein language models, we develop GAT-GO, a graph attention network (GAT) method that may substantially improve protein function prediction by leveraging predicted structure information and protein sequence embedding. Our experimental results show that GAT-GO greatly outperforms the latest sequence- and structure-based deep learning methods. On the PDB-mmseqs testset where the train and test proteins share <15% sequence identity, our GAT-GO yields Fmax (maximum F-score) 0.508, 0.416, 0.501, and area under the precision-recall curve (AUPRC) 0.427, 0.253, 0.411 for the MFO, BPO, CCO ontology domains, respectively, much better than the homology-based method BLAST (Fmax 0.117, 0.121, 0.207 and AUPRC 0.120, 0.120, 0.163) that does not use any structure information. On the PDB-cdhit testset where the training and test proteins are more similar, although using predicted structure information, our GAT-GO obtains Fmax 0.637, 0.501, 0.542 for the MFO, BPO, CCO ontology domains, respectively, and AUPRC 0.662, 0.384, 0.481, significantly exceeding the just-published method DeepFRI that uses experimental structures, which has Fmax 0.542, 0.425, 0.424 and AUPRC only 0.313, 0.159, 0.193.
Collapse
Affiliation(s)
- Boqiao Lai
- Toyota Technological Institute at Chicago, Chicago, IL 60637, USA
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, IL 60637, USA
| |
Collapse
|
25
|
Matkovic R, Morel M, Lanciano S, Larrous P, Martin B, Bejjani F, Vauthier V, Hansen MMK, Emiliani S, Cristofari G, Gallois-Montbrun S, Margottin-Goguet F. TASOR epigenetic repressor cooperates with a CNOT1 RNA degradation pathway to repress HIV. Nat Commun 2022; 13:66. [PMID: 35013187 PMCID: PMC8748822 DOI: 10.1038/s41467-021-27650-5] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Accepted: 11/30/2021] [Indexed: 12/17/2022] Open
Abstract
The Human Silencing Hub (HUSH) complex constituted of TASOR, MPP8 and Periphilin recruits the histone methyl-transferase SETDB1 to spread H3K9me3 repressive marks across genes and transgenes in an integration site-dependent manner. The deposition of these repressive marks leads to heterochromatin formation and inhibits gene expression, but the underlying mechanism is not fully understood. Here, we show that TASOR silencing or HIV-2 Vpx expression, which induces TASOR degradation, increases the accumulation of transcripts derived from the HIV-1 LTR promoter at a post-transcriptional level. Furthermore, using a yeast 2-hybrid screen, we identify new TASOR partners involved in RNA metabolism including the RNA deadenylase CCR4-NOT complex scaffold CNOT1. TASOR and CNOT1 synergistically repress HIV expression from its LTR. Similar to the RNA-induced transcriptional silencing complex found in fission yeast, we show that TASOR interacts with the RNA exosome and RNA Polymerase II, predominantly under its elongating state. Finally, we show that TASOR facilitates the association of RNA degradation proteins with RNA polymerase II and is detected at transcriptional centers. Altogether, we propose that HUSH operates at the transcriptional and post-transcriptional levels to repress HIV proviral expression.
Collapse
Affiliation(s)
- Roy Matkovic
- Université de Paris, Institut Cochin, INSERM, CNRS, 75014, Paris, France.
| | - Marina Morel
- Université de Paris, Institut Cochin, INSERM, CNRS, 75014, Paris, France
| | | | - Pauline Larrous
- Université de Paris, Institut Cochin, INSERM, CNRS, 75014, Paris, France
| | - Benjamin Martin
- Université de Paris, Institut Cochin, INSERM, CNRS, 75014, Paris, France
| | - Fabienne Bejjani
- Université de Paris, Institut Cochin, INSERM, CNRS, 75014, Paris, France
| | - Virginie Vauthier
- Université de Paris, Institut Cochin, INSERM, CNRS, 75014, Paris, France
| | - Maike M K Hansen
- Institute for Molecules and Materials, Radboud University, 6525 AM, Nijmegen, The Netherlands
| | - Stéphane Emiliani
- Université de Paris, Institut Cochin, INSERM, CNRS, 75014, Paris, France
| | | | | | | |
Collapse
|
26
|
Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022; 23:40-55. [PMID: 34518686 DOI: 10.1038/s41580-021-00407-0] [Citation(s) in RCA: 782] [Impact Index Per Article: 260.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 02/08/2023]
Abstract
The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.
Collapse
Affiliation(s)
- Joe G Greener
- Department of Computer Science, University College London, London, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, UK
| | - Lewis Moffat
- Department of Computer Science, University College London, London, UK
| | - David T Jones
- Department of Computer Science, University College London, London, UK.
| |
Collapse
|
27
|
Törönen P, Holm L. PANNZER-A practical tool for protein function prediction. Protein Sci 2022; 31:118-128. [PMID: 34562305 PMCID: PMC8740830 DOI: 10.1002/pro.4193] [Citation(s) in RCA: 74] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 09/22/2021] [Accepted: 09/22/2021] [Indexed: 01/03/2023]
Abstract
The facility of next-generation sequencing has led to an explosion of gene catalogs for novel genomes, transcriptomes and metagenomes, which are functionally uncharacterized. Computational inference has emerged as a necessary substitute for first-hand experimental evidence. PANNZER (Protein ANNotation with Z-scoRE) is a high-throughput functional annotation web server that stands out among similar publically accessible web servers in supporting submission of up to 100,000 protein sequences at once and providing both Gene Ontology (GO) annotations and free text description predictions. Here, we demonstrate the use of PANNZER and discuss future plans and challenges. We present two case studies to illustrate problems related to data quality and method evaluation. Some commonly used evaluation metrics and evaluation datasets promote methods that favor unspecific and broad functional classes over more informative and specific classes. We argue that this can bias the development of automated function prediction methods. The PANNZER web server and source code are available at http://ekhidna2.biocenter.helsinki.fi/sanspanz/.
Collapse
Affiliation(s)
- Petri Törönen
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of HelsinkiHelsinkiFinland
| | - Liisa Holm
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of HelsinkiHelsinkiFinland,Organismal and Evolutionary Biology Research Program, Faculty of BiosciencesUniversity of HelsinkiHelsinkiFinland
| |
Collapse
|
28
|
Protein function prediction using functional inter-relationship. Comput Biol Chem 2021; 95:107593. [PMID: 34736126 DOI: 10.1016/j.compbiolchem.2021.107593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 08/25/2021] [Accepted: 10/03/2021] [Indexed: 11/23/2022]
Abstract
With the growth of high throughput sequencing techniques, the generation of protein sequences has become fast and cheap, leading to a huge increase in the number of known proteins. However, it is challenging to identify the functions being performed by these newly discovered proteins. Machine learning techniques have improved traditional methods' efficiency by suggesting relevant functions but fails to perform well when the number of functions to be predicted becomes large. In this work, we propose a machine learning-based approach to predict huge set of protein functions that use the inter-relationships between functions to improve the model's predictability. These inter-relationships of functions is used to reduce the redundancy caused by highly correlated functions. The proposed model is trained on the reduced set of non-redundant functions hindering the ambiguity caused due to inter-related functions. Here, we use two statistical approaches 1) Pearson's correlation coefficient 2) Jaccard similarity coefficient, as a measure of correlation to remove redundant functions. To have a fair evaluation of the proposed model, we recreate our original function set by inverse transforming the reduced set using the two proposed approaches: Direct mapping and Ensemble approach. The model is tested using different feature sets and function sets of biological processes and molecular functions to get promising results on DeepGO and CAFA3 dataset. The proposed model is able to predict specific functions for the test data which were unpredictable by other compared methods. The experimental models, code and other relevant data are available at https://github.com/richadhanuka/PFP-using-Functional-interrelationship.
Collapse
|
29
|
Zhang F, Song H, Zeng M, Wu FX, Li Y, Pan Y, Li M. A Deep Learning Framework for Gene Ontology Annotations With Sequence- and Network-Based Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2208-2217. [PMID: 31985440 DOI: 10.1109/tcbb.2020.2968882] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Knowledge of protein functions plays an important role in biology and medicine. With the rapid development of high-throughput technologies, a huge number of proteins have been discovered. However, there are a great number of proteins without functional annotations. A protein usually has multiple functions and some functions or biological processes require interactions of a plurality of proteins. Additionally, Gene Ontology provides a useful classification for protein functions and contains more than 40,000 terms. We propose a deep learning framework called DeepGOA to predict protein functions with protein sequences and protein-protein interaction (PPI) networks. For protein sequences, we extract two types of information: sequence semantic information and subsequence-based features. We use the word2vec technique to numerically represent protein sequences, and utilize a Bi-directional Long and Short Time Memory (Bi-LSTM) and multi-scale convolutional neural network (multi-scale CNN) to obtain the global and local semantic features of protein sequences, respectively. Additionally, we use the InterPro tool to scan protein sequences for extracting subsequence-based information, such as domains and motifs. Then, the information is plugged into a neural network to generate high-quality features. For the PPI network, the Deepwalk algorithm is applied to generate its embedding information of PPI. Then the two types of features are concatenated together to predict protein functions. To evaluate the performance of DeepGOA, several different evaluation methods and metrics are utilized. The experimental results show that DeepGOA outperforms DeepGO and BLAST.
Collapse
|
30
|
Elhaj-Abdou MEM, El-Dib H, El-Helw A, El-Habrouk M. Deep_CNN_LSTM_GO: Protein function prediction from amino-acid sequences. Comput Biol Chem 2021; 95:107584. [PMID: 34601431 DOI: 10.1016/j.compbiolchem.2021.107584] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 09/08/2021] [Accepted: 09/21/2021] [Indexed: 11/15/2022]
Abstract
Protein amino acid sequences can be used to determine the functions of the protein. However, determining the function of a single protein requires many resources and a tremendous amount of time. Computational Intelligence methods such as Deep learning have been shown to predict the proteins' functions. This paper proposes a hybrid deep neural network model to predict an unknown protein's functions from sequences. The proposed model is named Deep_CNN_LSTM_GO. Deep_CNN_LSTM_GO is an Integration between Convolutional Neural network (CNN) and Long Short-Term Memory (LSTM) Neural Network to learn features from amino acid sequences and outputs the three different Gene Ontology (GO). The gene ontology represents the protein functions in the three sub-ontologies: Molecular Functions (MF), Biological Process (BP), and Cellular Component (CC). The proposed model has been trained and tested using UniProt-SwissProt's dataset. Another test has been done using Computational Assessment of Function Annotation (CAFA) on the three sub-ontologies. The proposed model outperforms different methods proposed in the field with better performance using three different evaluation metrics (Fmax, Smin, and AUPR) in the three sub-ontologies (MF, BP, CC).
Collapse
Affiliation(s)
- Mohamed E M Elhaj-Abdou
- Faculty of Engineering, Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt.
| | - Hassan El-Dib
- Faculty of Engineering, Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt.
| | - Amr El-Helw
- Faculty of Engineering, Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt.
| | | |
Collapse
|
31
|
Vu TTD, Jung J. Protein function prediction with gene ontology: from traditional to deep learning models. PeerJ 2021; 9:e12019. [PMID: 34513334 PMCID: PMC8395570 DOI: 10.7717/peerj.12019] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 07/29/2021] [Indexed: 11/25/2022] Open
Abstract
Protein function prediction is a crucial part of genome annotation. Prediction methods have recently witnessed rapid development, owing to the emergence of high-throughput sequencing technologies. Among the available databases for identifying protein function terms, Gene Ontology (GO) is an important resource that describes the functional properties of proteins. Researchers are employing various approaches to efficiently predict the GO terms. Meanwhile, deep learning, a fast-evolving discipline in data-driven approach, exhibits impressive potential with respect to assigning GO terms to amino acid sequences. Herein, we reviewed the currently available computational GO annotation methods for proteins, ranging from conventional to deep learning approach. Further, we selected some suitable predictors from among the reviewed tools and conducted a mini comparison of their performance using a worldwide challenge dataset. Finally, we discussed the remaining major challenges in the field, and emphasized the future directions for protein function prediction with GO.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| |
Collapse
|
32
|
Unveiling the structure of GPI-anchored protein of Malassezia globosa and its pathogenic role in pityriasis versicolor. J Mol Model 2021; 27:246. [PMID: 34379190 DOI: 10.1007/s00894-021-04853-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 07/07/2021] [Indexed: 10/20/2022]
Abstract
Glycosylphosphatidylinositols (GPI)-anchored proteins (GpiPs) are related to the cell wall biogenesis, adhesion, interactions, protease activity, mating, etc. These proteins have been identified in many organisms, including fungi such as Neurospora crassa, Candida albicans, Saccharomyces cerevisiae, and Fusarium graminearum. MGL-3153 gene of Malassezia globosa (M. globosa) encodes a protein which is homologous of the M. restricta, M. sympodialis, M. Pachydermatis, and U. maydis GpiPs. Real-time PCR assay showed that the expression of MGL_3153 gene was significantly up-regulated among M. globosa isolated from patients with pityriasis versicolor (PV) compared to a healthy individual, suggesting the contribution of this gene in the virulence of M. globosa. Accordingly, the sequence of this protein was analyzed by bioinformatics tools to evaluate the structure of that. The conservation analysis of MGL-3153 protein showed that the C-terminal region of this protein, which is responsible for GPI-anchor ligation, was highly conserved during evolution while the N-terminal region just conserved in Malassezia species. Moreover, the predicted tertiary structure of this protein by homology modeling showed that this protein almost has alpha helix structure and represented a stable structure during 150 ns of molecular dynamic simulation. Our results revealed that this protein potentially belongs to GPI-anchored proteins and may contribute to the virulence of M. globosa which warrants further investigations in this area.
Collapse
|
33
|
Kulmanov M, Smaili FZ, Gao X, Hoehndorf R. Semantic similarity and machine learning with ontologies. Brief Bioinform 2021; 22:bbaa199. [PMID: 33049044 PMCID: PMC8293838 DOI: 10.1093/bib/bbaa199] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 08/03/2020] [Accepted: 08/04/2020] [Indexed: 12/13/2022] Open
Abstract
Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.
Collapse
Affiliation(s)
| | | | - Xin Gao
- Computational Bioscience Research Center and lead of the Structural and Functional Bioinformatics Group at King Abdullah University of Science and Technology
| | | |
Collapse
|
34
|
Kulmanov M, Zhapa-Camacho F, Hoehndorf R. DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web. Nucleic Acids Res 2021; 49:W140-W146. [PMID: 34019664 PMCID: PMC8262746 DOI: 10.1093/nar/gkab373] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Revised: 04/18/2021] [Accepted: 04/26/2021] [Indexed: 11/24/2022] Open
Abstract
Understanding the functions of proteins is crucial to understand biological processes on a molecular level. Many more protein sequences are available than can be investigated experimentally. DeepGOPlus is a protein function prediction method based on deep learning and sequence similarity. DeepGOWeb makes the prediction model available through a website, an API, and through the SPARQL query language for interoperability with databases that rely on Semantic Web technologies. DeepGOWeb provides accurate and fast predictions and ensures that predicted functions are consistent with the Gene Ontology; it can provide predictions for any protein and any function in Gene Ontology. DeepGOWeb is freely available at https://deepgo.cbrc.kaust.edu.sa/.
Collapse
Affiliation(s)
- Maxat Kulmanov
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, 4700 King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
| | - Fernando Zhapa-Camacho
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, 4700 King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, 4700 King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
35
|
Structure-based protein function prediction using graph convolutional networks. Nat Commun 2021; 12:3168. [PMID: 34039967 PMCID: PMC8155034 DOI: 10.1038/s41467-021-23303-9] [Citation(s) in RCA: 337] [Impact Index Per Article: 84.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 04/22/2021] [Indexed: 02/04/2023] Open
Abstract
The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Augmenting the training set of experimental structures with homology models allows us to significantly expand the number of predictable functions. DeepFRI has significant de-noising capability, with only a minor drop in performance when experimental structures are replaced by protein models. Class activation mapping allows function predictions at an unprecedented resolution, allowing site-specific annotations at the residue-level in an automated manner. We show the utility and high performance of our method by annotating structures from the PDB and SWISS-MODEL, making several new confident function predictions. DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/ .
Collapse
|
36
|
Villegas-Morcillo A, Makrodimitris S, van Ham RCHJ, Gomez AM, Sanchez V, Reinders MJT. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 2021; 37:162-170. [PMID: 32797179 PMCID: PMC8055213 DOI: 10.1093/bioinformatics/btaa701] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Revised: 07/10/2020] [Accepted: 08/12/2020] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Amelia Villegas-Morcillo
- Department of Signal Theory, Telematics and Communications, University of Granada, 18071 Granada, Spain
| | - Stavros Makrodimitris
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Roeland C H J van Ham
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Angel M Gomez
- Department of Signal Theory, Telematics and Communications, University of Granada, 18071 Granada, Spain
| | - Victoria Sanchez
- Department of Signal Theory, Telematics and Communications, University of Granada, 18071 Granada, Spain
| | - Marcel J T Reinders
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, 2333ZC Leiden, The Netherlands
| |
Collapse
|
37
|
Ofer D, Brandes N, Linial M. The language of proteins: NLP, machine learning & protein sequences. Comput Struct Biotechnol J 2021; 19:1750-1758. [PMID: 33897979 PMCID: PMC8050421 DOI: 10.1016/j.csbj.2021.03.022] [Citation(s) in RCA: 145] [Impact Index Per Article: 36.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Revised: 03/19/2021] [Accepted: 03/19/2021] [Indexed: 12/12/2022] Open
Abstract
Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.
Collapse
Affiliation(s)
| | - Nadav Brandes
- The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| |
Collapse
|
38
|
Seyyedsalehi SF, Soleymani M, Rabiee HR, Mofrad MRK. PFP-WGAN: Protein function prediction by discovering Gene Ontology term correlations with generative adversarial networks. PLoS One 2021; 16:e0244430. [PMID: 33630862 PMCID: PMC7906332 DOI: 10.1371/journal.pone.0244430] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Accepted: 12/09/2020] [Indexed: 12/12/2022] Open
Abstract
Understanding the functionality of proteins has emerged as a critical problem in recent years due to significant roles of these macro-molecules in biological mechanisms. However, in-laboratory techniques for protein function prediction are not as efficient as methods developed and processed for protein sequencing. While more than 70 million protein sequences are available today, only the functionality of around one percent of them are known. These facts have encouraged researchers to develop computational methods to infer protein functionalities from their sequences. Gene Ontology is the most well-known database for protein functions which has a hierarchical structure, where deeper terms are more determinative and specific. However, the lack of experimentally approved annotations for these specific terms limits the performance of computational methods applied on them. In this work, we propose a method to improve protein function prediction using their sequences by deeply extracting relationships between Gene Ontology terms. To this end, we construct a conditional generative adversarial network which helps to effectively discover and incorporate term correlations in the annotation process. In addition to the baseline algorithms, we compare our method with two recently proposed deep techniques that attempt to utilize Gene Ontology term correlations. Our results confirm the superiority of the proposed method compared to the previous works. Moreover, we demonstrate how our model can effectively help to assign more specific terms to sequences.
Collapse
Affiliation(s)
- Seyyede Fatemeh Seyyedsalehi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
- Department of Mechanical Engineering, University of California Berkeley, Berkeley, California, United States of America
| | - Mahdieh Soleymani
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Hamid R. Rabiee
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Mohammad R. K. Mofrad
- Department of Mechanical Engineering, University of California Berkeley, Berkeley, California, United States of America
| |
Collapse
|
39
|
Barot M, Gligorijević V, Cho K, Bonneau R. NetQuilt: Deep Multispecies Network-based Protein Function Prediction using Homology-informed Network Similarity. Bioinformatics 2021; 37:2414-2422. [PMID: 33576802 PMCID: PMC8388039 DOI: 10.1093/bioinformatics/btab098] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 02/04/2021] [Accepted: 02/09/2021] [Indexed: 02/02/2023] Open
Abstract
Motivation Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to protein functional annotation use sequence similarity to transfer knowledge between species. These approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular context for meaningful prediction. To supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, most of these methods are tied to a network for a single species, and many species lack biological networks. Results In this work, we integrate sequence and network information across multiple species by computing IsoRank similarity scores to create a meta-network profile of the proteins of multiple species. We use this integrated multispecies meta-network as input to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and consequently leads to significant improvements in function prediction performance compared to two network-based methods, a deep learning sequence-based method and the BLAST annotation method used in the Critial Assessment of Functional Annotation. We are able to demonstrate that our approach performs well even in cases where a species has no network information available: when an organism’s PPI network is left out we can use our multi-species method to make predictions for the left-out organism with good performance. Availability and implementation The code is freely available at https://github.com/nowittynamesleft/NetQuilt. The data, including sequences, PPI networks and GO annotations are available at https://string-db.org/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meet Barot
- Center for Data Science, New York University, New York, 10011, USA
| | | | - Kyunghyun Cho
- Center for Data Science, New York University, New York, 10011, USA
| | - Richard Bonneau
- Center for Data Science, New York University, New York, 10011, USA.,Center for Computational Biology, Flatiron Institute, New York, 10010, USA
| |
Collapse
|
40
|
Venko K, Novič M, Stoka V, Žerovnik E. Prediction of Transmembrane Regions, Cholesterol, and Ganglioside Binding Sites in Amyloid-Forming Proteins Indicate Potential for Amyloid Pore Formation. Front Mol Neurosci 2021; 14:619496. [PMID: 33642992 PMCID: PMC7902868 DOI: 10.3389/fnmol.2021.619496] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 01/12/2021] [Indexed: 12/26/2022] Open
Abstract
Besides amyloid fibrils, amyloid pores (APs) represent another mechanism of amyloid induced toxicity. Since hypothesis put forward by Arispe and collegues in 1993 that amyloid-beta makes ion-conducting channels and that Alzheimer's disease may be due to the toxic effect of these channels, many studies have confirmed that APs are formed by prefibrillar oligomers of amyloidogenic proteins and are a common source of cytotoxicity. The mechanism of pore formation is still not well-understood and the structure and imaging of APs in living cells remains an open issue. To get closer to understand AP formation we used predictive methods to assess the propensity of a set of 30 amyloid-forming proteins (AFPs) to form transmembrane channels. A range of amino-acid sequence tools were applied to predict AP domains of AFPs, and provided context on future experiments that are needed in order to contribute toward a deeper understanding of amyloid toxicity. In a set of 30 AFPs we predicted their amyloidogenic propensity, presence of transmembrane (TM) regions, and cholesterol (CBM) and ganglioside binding motifs (GBM), to which the oligomers likely bind. Noteworthy, all pathological AFPs share the presence of TM, CBM, and GBM regions, whereas the functional amyloids seem to show just one of these regions. For comparative purposes, we also analyzed a few examples of amyloid proteins that behave as biologically non-relevant AFPs. Based on the known experimental data on the β-amyloid and α-synuclein pore formation, we suggest that many AFPs have the potential for pore formation. Oligomerization and α-TM helix to β-TM strands transition on lipid rafts seem to be the common key events.
Collapse
Affiliation(s)
- Katja Venko
- Theory Department, National Institute of Chemistry, Ljubljana, Slovenia
| | - Marjana Novič
- Theory Department, National Institute of Chemistry, Ljubljana, Slovenia
| | - Veronika Stoka
- Department of Biochemistry and Molecular and Structural Biology, Jožef Stefan Institute, Ljubljana, Slovenia
| | - Eva Žerovnik
- Department of Biochemistry and Molecular and Structural Biology, Jožef Stefan Institute, Ljubljana, Slovenia
| |
Collapse
|
41
|
Littmann M, Heinzinger M, Dallago C, Olenyi T, Rost B. Embeddings from deep learning transfer GO annotations beyond homology. Sci Rep 2021; 11:1160. [PMID: 33441905 PMCID: PMC7806674 DOI: 10.1038/s41598-020-80786-0] [Citation(s) in RCA: 82] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Accepted: 12/24/2020] [Indexed: 11/09/2022] Open
Abstract
Knowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an Fmax of 37 ± 2%, 50 ± 3%, and 57 ± 2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with < 20% pairwise sequence identity to the query, performance drops (Fmax BPO 33 ± 2%, MFO 43 ± 3%, CCO 53 ± 2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.
Collapse
Affiliation(s)
- Maria Littmann
- Department of Informatics, Bioinformatics and Computational Biology, i12, TUM (Technical University of Munich), Boltzmannstr. 3, Garching, 85748, Munich, Germany.
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics and Computational Biology, i12, TUM (Technical University of Munich), Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Christian Dallago
- Department of Informatics, Bioinformatics and Computational Biology, i12, TUM (Technical University of Munich), Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Tobias Olenyi
- Department of Informatics, Bioinformatics and Computational Biology, i12, TUM (Technical University of Munich), Boltzmannstr. 3, Garching, 85748, Munich, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology, i12, TUM (Technical University of Munich), Boltzmannstr. 3, Garching, 85748, Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, 85748, Munich, Germany
- School of Life Sciences Weihenstephan (TUM-WZW), TUM (Technical University of Munich), Alte Akademie 8, Freising, Germany
- Department of Biochemistry and Molecular Biophysics, Columbia University, 701 West, 168th Street, New York, NY, 10032, USA
| |
Collapse
|
42
|
Wu F, Ma J, Cha Y, Lu D, Li Z, Zhuo M, Luo X, Li S, Zhu M. Using inexpensive substrate to achieve high-level lipase A secretion by Bacillus subtilis through signal peptide and promoter screening. Process Biochem 2020. [DOI: 10.1016/j.procbio.2020.08.010] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
43
|
Semwal R, Varadwaj PK. HumDLoc: Human Protein Subcellular Localization Prediction Using Deep Neural Network. Curr Genomics 2020; 21:546-557. [PMID: 33214771 PMCID: PMC7604748 DOI: 10.2174/1389202921999200528160534] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 03/27/2020] [Accepted: 03/30/2020] [Indexed: 11/24/2022] Open
Abstract
Aims To develop a tool that can annotate subcellular localization of human proteins. Background With the progression of high throughput human proteomics projects, an enormous amount of protein sequence data has been discovered in the recent past. All these raw sequence data require precise mapping and annotation for their respective biological role and functional attributes. The functional characteristics of protein molecules are highly dependent on the subcellular localization/compartment. Therefore, a fully automated and reliable protein subcellular localization prediction system would be very useful for current proteomic research. Objective To develop a machine learning-based predictive model that can annotate the subcellular localization of human proteins with high accuracy and precision. Methods In this study, we used the PSI-CD-HIT homology criterion and utilized the sequence-based features of protein sequences to develop a powerful subcellular localization predictive model. The dataset used to train the HumDLoc model was extracted from a reliable data source, Uniprot knowledge base, which helps the model to generalize on the unseen dataset. Results The proposed model, HumDLoc, was compared with two of the most widely used techniques: CELLO and DeepLoc, and other machine learning-based tools. The result demonstrated promising predictive performance of HumDLoc model based on various machine learning parameters such as accuracy (≥97.00%), precision (≥0.86), recall (≥0.89), MCC score (≥0.86), ROC curve (0.98 square unit), and precision-recall curve (0.93 square unit). Conclusion In conclusion, HumDLoc was able to outperform several alternative tools for correctly predicting subcellular localization of human proteins. The HumDLoc has been hosted as a web-based tool at https://bioserver.iiita.ac.in/HumDLoc/.
Collapse
Affiliation(s)
- Rahul Semwal
- 1Department of Information Technology (Bioinformatics), Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India; 2Department of Bioinformatics and Applied Science, Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India
| | - Pritish Kumar Varadwaj
- 1Department of Information Technology (Bioinformatics), Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India; 2Department of Bioinformatics and Applied Science, Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India
| |
Collapse
|
44
|
Du Z, He Y, Li J, Uversky VN. DeepAdd: Protein function prediction from k-mer embedding and additional features. Comput Biol Chem 2020; 89:107379. [PMID: 33011616 DOI: 10.1016/j.compbiolchem.2020.107379] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2019] [Revised: 09/15/2020] [Accepted: 09/17/2020] [Indexed: 10/23/2022]
Abstract
With the application of new high throughput sequencing technology, a large number of protein sequences is becoming available. Determination of the functional characteristics of these proteins by experiments is an expensive endeavor that requires a lot of time. Furthermore, at the organismal level, such kind of experimental functional analyses can be conducted only for a very few selected model organisms. Computational function prediction methods can be used to fill this gap. The functions of proteins are classified by Gene Ontology (GO), which contains more than 40,000 classifications in three domains, Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). Additionally, since proteins have many functions, function prediction represents a multi-label and multi-class problem. We developed a new method to predict protein function from sequence. To this end, natural language model was used to generate word embedding of sequence and learn features from it by deep learning, and additional features to locate every protein. Our method uses the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and have noticeable improvement over several algorithms, such as FFPred, DeepGO, GoFDR and other methods compared on the CAFA3 datasets.
Collapse
Affiliation(s)
- Zhihua Du
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China.
| | - Yufeng He
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China
| | - Jianqiang Li
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China
| | - Vladimir N Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, FL, USA; USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, FL, USA; Laboratory of New Methods in Biology, Institute for Biological Instrumentation, Russian Academy of Sciences, Institutskaya Str., 7, Pushchino, Moscow Region, 142290, Russia.
| |
Collapse
|
45
|
Zhang C, Zheng W, Mortuza SM, Li Y, Zhang Y. DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics 2020; 36:2105-2112. [PMID: 31738385 DOI: 10.1093/bioinformatics/btz863] [Citation(s) in RCA: 110] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 10/17/2019] [Accepted: 11/15/2019] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. RESULTS We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. AVAILABILITY AND IMPLEMENTATION https://zhanglab.ccmb.med.umich.edu/DeepMSA/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - S M Mortuza
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
46
|
Dahal S, Yurkovich JT, Xu H, Palsson BO, Yang L. Synthesizing Systems Biology Knowledge from Omics Using Genome-Scale Models. Proteomics 2020; 20:e1900282. [PMID: 32579720 PMCID: PMC7501203 DOI: 10.1002/pmic.201900282] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Revised: 06/13/2020] [Indexed: 12/18/2022]
Abstract
Omic technologies have enabled the complete readout of the molecular state of a cell at different biological scales. In principle, the combination of multiple omic data types can provide an integrated view of the entire biological system. This integration requires appropriate models in a systems biology approach. Here, genome-scale models (GEMs) are focused upon as one computational systems biology approach for interpreting and integrating multi-omic data. GEMs convert the reactions (related to metabolism, transcription, and translation) that occur in an organism to a mathematical formulation that can be modeled using optimization principles. A variety of genome-scale modeling methods used to interpret multiple omic data types, including genomics, transcriptomics, proteomics, metabolomics, and meta-omics are reviewed. The ability to interpret omics in the context of biological systems has yielded important findings for human health, environmental biotechnology, bioenergy, and metabolic engineering. The authors find that concurrent with advancements in omic technologies, genome-scale modeling methods are also expanding to enable better interpretation of omic data. Therefore, continued synthesis of valuable knowledge, through the integration of omic data with GEMs, are expected.
Collapse
Affiliation(s)
- Sanjeev Dahal
- Department of Chemical Engineering, Queen’s University, Kingston, Canada
| | | | - Hao Xu
- Department of Chemical Engineering, Queen’s University, Kingston, Canada
| | - Bernhard O. Palsson
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Laurence Yang
- Department of Chemical Engineering, Queen’s University, Kingston, Canada
| |
Collapse
|
47
|
Stamboulian M, Guerrero RF, Hahn MW, Radivojac P. The ortholog conjecture revisited: the value of orthologs and paralogs in function prediction. Bioinformatics 2020; 36:i219-i226. [PMID: 32657391 PMCID: PMC7355290 DOI: 10.1093/bioinformatics/btaa468] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION The computational prediction of gene function is a key step in making full use of newly sequenced genomes. Function is generally predicted by transferring annotations from homologous genes or proteins for which experimental evidence exists. The 'ortholog conjecture' proposes that orthologous genes should be preferred when making such predictions, as they evolve functions more slowly than paralogous genes. Previous research has provided little support for the ortholog conjecture, though the incomplete nature of the data cast doubt on the conclusions. RESULTS We use experimental annotations from over 40 000 proteins, drawn from over 80 000 publications, to revisit the ortholog conjecture in two pairs of species: (i) Homo sapiens and Mus musculus and (ii) Saccharomyces cerevisiae and Schizosaccharomyces pombe. By making a distinction between questions about the evolution of function versus questions about the prediction of function, we find strong evidence against the ortholog conjecture in the context of function prediction, though questions about the evolution of function remain difficult to address. In both pairs of species, we quantify the amount of information that would be ignored if paralogs are discarded, as well as the resulting loss in prediction accuracy. Taken as a whole, our results support the view that the types of homologs used for function transfer are largely irrelevant to the task of function prediction. Maximizing the amount of data used for this task, regardless of whether it comes from orthologs or paralogs, is most likely to lead to higher prediction accuracy. AVAILABILITY AND IMPLEMENTATION https://github.com/predragradivojac/oc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Moses Stamboulian
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Rafael F Guerrero
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - Matthew W Hahn
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
- Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| |
Collapse
|
48
|
Koo DCE, Bonneau R. Towards region-specific propagation of protein functions. Bioinformatics 2020; 35:1737-1744. [PMID: 30304483 PMCID: PMC6513163 DOI: 10.1093/bioinformatics/bty834] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 08/23/2018] [Accepted: 10/08/2018] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION Due to the nature of experimental annotation, most protein function prediction methods operate at the protein-level, where functions are assigned to full-length proteins based on overall similarities. However, most proteins function by interacting with other proteins or molecules, and many functional associations should be limited to specific regions rather than the entire protein length. Most domain-centric function prediction methods depend on accurate domain family assignments to infer relationships between domains and functions, with regions that are unassigned to a known domain-family left out of functional evaluation. Given the abundance of residue-level annotations currently available, we present a function prediction methodology that automatically infers function labels of specific protein regions using protein-level annotations and multiple types of region-specific features. RESULTS We apply this method to local features obtained from InterPro, UniProtKB and amino acid sequences and show that this method improves both the accuracy and region-specificity of protein function transfer and prediction. We compare region-level predictive performance of our method against that of a whole-protein baseline method using proteins with structurally verified binding sites and also compare protein-level temporal holdout predictive performances to expand the variety and specificity of GO terms we could evaluate. Our results can also serve as a starting point to categorize GO terms into region-specific and whole-protein terms and select prediction methods for different classes of GO terms. AVAILABILITY AND IMPLEMENTATION The code and features are freely available at: https://github.com/ek1203/rsfp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Da Chen Emily Koo
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA
| | - Richard Bonneau
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA.,Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.,Center for Data Science, New York University, New York, NY, USA
| |
Collapse
|
49
|
Buchan DWA, Jones DT. The PSIPRED Protein Analysis Workbench: 20 years on. Nucleic Acids Res 2020; 47:W402-W407. [PMID: 31251384 PMCID: PMC6602445 DOI: 10.1093/nar/gkz297] [Citation(s) in RCA: 917] [Impact Index Per Article: 183.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Revised: 04/02/2019] [Accepted: 04/15/2019] [Indexed: 02/07/2023] Open
Abstract
The PSIPRED Workbench is a web server offering a range of predictive methods to the bioscience community for 20 years. Here, we present the work we have completed to update the PSIPRED Protein Analysis Workbench and make it ready for the next 20 years. The main focus of our recent website upgrade work has been the acceleration of analyses in the face of increasing protein sequence database size. We additionally discuss any new software, the new hardware infrastructure, our webservices and web site. Lastly we survey updates to some of the key predictive algorithms available through our website.
Collapse
Affiliation(s)
- Daniel W A Buchan
- UCL Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK
| | - David T Jones
- UCL Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK
| |
Collapse
|
50
|
Strodthoff N, Wagner P, Wenzel M, Samek W. UDSMProt: universal deep sequence models for protein classification. Bioinformatics 2020; 36:2401-2409. [PMID: 31913448 PMCID: PMC7178389 DOI: 10.1093/bioinformatics/btaa003] [Citation(s) in RCA: 82] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2019] [Revised: 12/13/2019] [Accepted: 01/02/2020] [Indexed: 01/03/2023] Open
Abstract
MOTIVATION Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification are tailored to single classification tasks and rely on handcrafted features, such as position-specific-scoring matrices from expensive database searches. We argue that this level of performance can be reached or even be surpassed by learning a task-agnostic representation once, using self-supervised language modeling, and transferring it to specific tasks by a simple fine-tuning step. RESULTS We put forward a universal deep sequence model that is pre-trained on unlabeled protein sequences from Swiss-Prot and fine-tuned on protein classification tasks. We apply it to three prototypical tasks, namely enzyme class prediction, gene ontology prediction and remote homology and fold detection. The proposed method performs on par with state-of-the-art algorithms that were tailored to these specific tasks or, for two out of three tasks, even outperforms them. These results stress the possibility of inferring protein properties from the sequence alone and, on more general grounds, the prospects of modern natural language processing methods in omics. Moreover, we illustrate the prospects for explainable machine learning methods in this field by selected case studies. AVAILABILITY AND IMPLEMENTATION Source code is available under https://github.com/nstrodt/UDSMProt. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nils Strodthoff
- Department of Video Coding & Analytics, Fraunhofer Heinrich Hertz Institute, Berlin 10587, Germany
| | - Patrick Wagner
- Department of Video Coding & Analytics, Fraunhofer Heinrich Hertz Institute, Berlin 10587, Germany
| | - Markus Wenzel
- Department of Video Coding & Analytics, Fraunhofer Heinrich Hertz Institute, Berlin 10587, Germany
| | - Wojciech Samek
- Department of Video Coding & Analytics, Fraunhofer Heinrich Hertz Institute, Berlin 10587, Germany
| |
Collapse
|