1
|
Tan Y, Zhou B, Zheng L, Fan G, Hong L. Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability. eLife 2025; 13:RP98033. [PMID: 40314227 PMCID: PMC12048155 DOI: 10.7554/elife.98033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025] Open
Abstract
Protein engineering is a pivotal aspect of synthetic biology, involving the modification of amino acids within existing protein sequences to achieve novel or enhanced functionalities and physical properties. Accurate prediction of protein variant effects requires a thorough understanding of protein sequence, structure, and function. Deep learning methods have demonstrated remarkable performance in guiding protein modification for improved functionality. However, existing approaches predominantly rely on protein sequences, which face challenges in efficiently encoding the geometric aspects of amino acids' local environment and often fall short in capturing crucial details related to protein folding stability, internal molecular interactions, and bio-functions. Furthermore, there lacks a fundamental evaluation for developed methods in predicting protein thermostability, although it is a key physical property that is frequently investigated in practice. To address these challenges, this article introduces a novel pre-training framework that integrates sequential and geometric encoders for protein primary and tertiary structures. This framework guides mutation directions toward desired traits by simulating natural selection on wild-type proteins and evaluates variant effects based on their fitness to perform specific functions. We assess the proposed approach using three benchmarks comprising over 300 deep mutational scanning assays. The prediction results showcase exceptional performance across extensive experiments compared to other zero-shot learning methods, all while maintaining a minimal cost in terms of trainable parameters. This study not only proposes an effective framework for more accurate and comprehensive predictions to facilitate efficient protein engineering, but also enhances the in silico assessment system for future deep learning models to better align with empirical requirements. The PyTorch implementation is available at https://github.com/ai4protein/ProtSSN.
Collapse
Affiliation(s)
- Yang Tan
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- School of Information Science and Engineering, East China University of Science and TechnologyShanghaiChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Artificial Intelligence LaboratoryShanghaiChina
| | - Bingxin Zhou
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai Jiao Tong UniversityShanghaiChina
| | - Lirong Zheng
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and TechnologyShanghaiChina
| | - Liang Hong
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Artificial Intelligence LaboratoryShanghaiChina
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai Jiao Tong UniversityShanghaiChina
| |
Collapse
|
2
|
Ciuchcinski K, Kaczorowska AK, Biernacka D, Dorawa S, Kaczorowski T, Park Y, Piekarski K, Stanowski M, Ishikawa T, Stokke R, Steen IH, Dziewit L. Computational pipeline for sustainable enzyme discovery through (re)use of metagenomic data. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2025; 382:125381. [PMID: 40252419 DOI: 10.1016/j.jenvman.2025.125381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/29/2024] [Revised: 04/03/2025] [Accepted: 04/13/2025] [Indexed: 04/21/2025]
Abstract
Enzymes derived from extremophilic organisms, also known as extremozymes, offer sustainable and efficient solutions for industrial applications. Valued for their resilience and low environmental impact, extremozymes have found use as catalysts in various processes, ranging from dairy production to pharmaceutical manufacturing. However, discovery of novel extremozymes is often hindered by challenges such as culturing difficulties, underrepresentation of extreme environments in reference databases, and limitations of traditional sequence-based screening methods. In this work, we present a computational pipeline designed to discover novel enzymes from metagenomic data derived from extreme environments. This pipeline represents a versatile and sustainable approach that promotes reuse and recycling of existing datasets and minimises the need for additional environmental sampling. In its core, the algorithm integrates both traditional bioinformatic techniques and recent advances in structural prediction, enabling rapid and accurate identification of enzymes. However, due to its design, the algorithm relies heavily on existing databases, which can limit its effectiveness in situations where reference data is scarce or when encountering novel protein families. As a proof-of-concept, we applied the pipeline to metagenomic data from deep-sea hydrothermal vents, with a focus on β-galactosidases. The pipeline identified 11 potential candidate proteins, out of which 10 showed in vitro activity. One of the selected enzymes, βGal_UW07, showed strong potential for industrial applications. The enzyme exhibited optimal activity at 70 °C and was exceptionally resistant to high pH and the presence of metal ions and reducing agents. Overall, our results indicate that the pipeline is highly accurate and can play a key role in sustainable bioprospecting, leveraging existing metagenomic datasets and minimising in situ interventions in pristine regions.
Collapse
Affiliation(s)
- Karol Ciuchcinski
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| | - Anna-Karina Kaczorowska
- Collection of Plasmids and Microorganisms | KPD, Department of Microbiology, Faculty of Biology, University of Gdańsk, Wita Stwosza 59, 80-308, Gdańsk, Poland.
| | - Daria Biernacka
- Collection of Plasmids and Microorganisms | KPD, Department of Microbiology, Faculty of Biology, University of Gdańsk, Wita Stwosza 59, 80-308, Gdańsk, Poland; Structural Biology Laboratory, Intercollegiate Faculty of Biotechnology of University of Gdansk and Medical University of Gdańsk, Abrahama 58, 80-307, Gdańsk, Poland.
| | - Sebastian Dorawa
- Laboratory of Extremophiles Biology, Department of Microbiology, Faculty of Biology, University of Gdańsk, Wita Stwosza 59, 80-308, Gdańsk, Poland.
| | - Tadeusz Kaczorowski
- Laboratory of Extremophiles Biology, Department of Microbiology, Faculty of Biology, University of Gdańsk, Wita Stwosza 59, 80-308, Gdańsk, Poland.
| | - Younginn Park
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| | - Karol Piekarski
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| | - Michal Stanowski
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| | - Takao Ishikawa
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| | - Runar Stokke
- Department of Biological Sciences, Center for Deep Sea Research, University of Bergen, Postboks 7803, N-5020, Bergen, Norway.
| | - Ida Helene Steen
- Department of Biological Sciences, Center for Deep Sea Research, University of Bergen, Postboks 7803, N-5020, Bergen, Norway.
| | - Lukasz Dziewit
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| |
Collapse
|
3
|
Weissenow K, Rost B. Are protein language models the new universal key? Curr Opin Struct Biol 2025; 91:102997. [PMID: 39921962 DOI: 10.1016/j.sbi.2025.102997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Revised: 12/20/2024] [Accepted: 01/16/2025] [Indexed: 02/10/2025]
Abstract
Protein language models (pLMs) capture some aspects of the grammar of the language of life as written in protein sequences. The so-called pLM embeddings implicitly contain this information. Therefore, embeddings can serve as the exclusive input into downstream supervised methods for protein prediction. Over the last 33 years, evolutionary information extracted through simple averaging for specific protein families from multiple sequence alignments (MSAs) has been the most successful universal key to the success of protein prediction. For many applications, MSA-free pLM-based predictions now have become significantly more accurate. The reason for this is often a combination of two aspects. Firstly, embeddings condense the grammar so efficiently that downstream prediction methods succeed with small models, i.e., they need few free parameters in particular in the era of exploding deep neural networks. Secondly, pLM-based methods provide protein-specific solutions. As additional benefit, once the pLM pre-training is complete, pLM-based solutions tend to consume much fewer resources than MSA-based solutions. In fact, we appeal to the community to rather optimize foundation models than to retrain new ones and to evolve incentives for solutions that require fewer resources even at some loss in accuracy. Although pLMs have not, yet, succeeded to entirely replace the body of solutions developed over three decades, they clearly are rapidly advancing as the universal key for protein prediction.
Collapse
Affiliation(s)
- Konstantin Weissenow
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany.
| | - Burkhard Rost
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany; TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
4
|
Røgen P. Sequence-Similar Protein Domain Pairs With Structural or Topological Dissimilarity. Proteins 2025; 93:588-597. [PMID: 39392124 PMCID: PMC11809131 DOI: 10.1002/prot.26753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 08/14/2024] [Accepted: 09/26/2024] [Indexed: 10/12/2024]
Abstract
For a variety of applications, protein structures are clustered by sequence similarity, and sequence-redundant structures are disregarded. Sequence-similar chains are likely to have similar structures, but significant structural variation, as measured with RMSD, has been documented for sequence-similar chains and found usually to have a functional explanation. Moving two neighboring stretches of backbone through each other may change the chain topology and alter possible folding paths. The size of this motion is compatible to a variation in a flexible loop. We search and find domains with alternate chain topology in CATH4.2 sequence families relatively independent of sequence identity and of structural similarity as measured by RMSD. Structural, topological, and functional representative sets should therefore keep sequence-similar domains not just with structural variation but also with topological variation. We present BCAlign that finds Alignment and superposition of protein Backbone Curves by optimizing a user chosen convex combination of structural derivation and derivation between the structure-based sequence alignment and an input sequence alignment. Steric and topological obstructions from deforming a curve into an aligned curve are then found by a previously developed algorithm. For highly sequence-similar domains, sequence-based structural alignment better represents the chains motion and generally reveals larger structural and topological variation than structure-based does. Fold-switching protein pairs have been reported to be most frequent between X-ray and NMR structures and estimated to be underrepresented in the PDB as the alternate configuration is harder to resolve. Here we similarly find chain topology most frequently altered between X-ray and NMR structures.
Collapse
Affiliation(s)
- Peter Røgen
- Department of Applied Mathematics and Computer ScienceTechnical University of DenmarkKongens LyngbyDenmark
| |
Collapse
|
5
|
Li R, He X, Wu C, Li M, Zhang J. Advances in structure-based allosteric drug design. Curr Opin Struct Biol 2025; 90:102974. [PMID: 39736214 DOI: 10.1016/j.sbi.2024.102974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Revised: 11/27/2024] [Accepted: 12/06/2024] [Indexed: 01/01/2025]
Abstract
The identification of allosteric binding sites forms a critical connection between structural and computational biology, substantially advancing the discovery of allosteric drugs. However, the prevailing strategies for allosteric drug development predominantly rely on high-throughput screening, which suffers from high failure rates due to a limited understanding of allosteric mechanisms. This review collects insights from case studies on allosteric mechanisms, protein structure databases and computation algorithm developments, aiming to enhance our comprehension of allostery and guide more effective allosteric drug development. A crucial element in this area is the integration of structural biology with computational biology, which is vital for translating three-dimensional structural datasets into available drug discovery knowledge. These datasets and AI algorithms underpin the establishment of the allosteric binding site identification leading to structure-activity relationships (SARs) and are fueling the development of computational algorithms tailored for allosteric proteins, thereby driving forward the field of allosteric drug discovery.
Collapse
Affiliation(s)
- Rui Li
- State Key Laboratory of Medical Genomics, National Research Center for Translational Medicine at Shanghai, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Medicinal Chemistry and Bioinformatics Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Xinheng He
- Medicinal Chemistry and Bioinformatics Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Chengwei Wu
- Medicinal Chemistry and Bioinformatics Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Mingyu Li
- Medicinal Chemistry and Bioinformatics Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Jian Zhang
- State Key Laboratory of Medical Genomics, National Research Center for Translational Medicine at Shanghai, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Medicinal Chemistry and Bioinformatics Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China; Key Laboratory of Protection, Development, and Utilization of Medicinal Resources in Liupanshan Area, Ministry of Education, Peptides & Protein Drug Research Center, School of Pharmacy, Ningxia Medical University, Yinchuan 750004, China.
| |
Collapse
|
6
|
Pan T, Bi Y, Wang X, Zhang Y, Webb GI, Gasser RB, Kurgan L, Song J. SCREEN: A Graph-based Contrastive Learning Tool to Infer Catalytic Residues and Assess Enzyme Mutations. GENOMICS, PROTEOMICS & BIOINFORMATICS 2025; 22:qzae094. [PMID: 39724324 PMCID: PMC11961199 DOI: 10.1093/gpbjnl/qzae094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Revised: 12/05/2024] [Accepted: 12/06/2024] [Indexed: 12/28/2024]
Abstract
The accurate identification of catalytic residues contributes to our understanding of enzyme functions in biological processes and pathways. The increasing number of protein sequences necessitates computational tools for the automated prediction of catalytic residues in enzymes. Here, we introduce SCREEN, a graph neural network for the high-throughput prediction of catalytic residues via the integration of enzyme functional and structural information. SCREEN constructs residue representations based on spatial arrangements and incorporates enzyme function priors into such representations through contrastive learning. We demonstrate that SCREEN (1) consistently outperforms currently-available predictors; (2) provides accurate results when applied to inferred enzyme structures; and (3) generalizes well to enzymes dissimilar from those in the training set. We also show that the putative catalytic residues predicted by SCREEN mimic key structural and biophysical characteristics of native catalytic residues. Moreover, using experimental datasets, we show that SCREEN's predictions can be used to distinguish residues with a high mutation tolerance from those likely to cause functional loss when mutated, indicating that this tool might be used to infer disease-associated mutations. SCREEN is publicly available at https://github.com/BioColLab/SCREEN and https://ngdc.cncb.ac.cn/biocode/tool/7580.
Collapse
Affiliation(s)
- Tong Pan
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton, VIC 3800, Australia
| | - Yue Bi
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton, VIC 3800, Australia
| | - Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton, VIC 3800, Australia
| | - Ying Zhang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Monash University, Clayton, VIC 3800, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton, VIC 3800, Australia
- Key Laboratory of Clinical Laboratory Diagnosis and Translational Research of Zhejiang Province, Department of Clinical Laboratory, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325015, China
| |
Collapse
|
7
|
Imam IA, Bailey S, Wang D, Zeng S, Xu D, Shao Q. Integrating Protein Language Model and Molecular Dynamics Simulations to Discover Antibiofouling Peptides. LANGMUIR : THE ACS JOURNAL OF SURFACES AND COLLOIDS 2025; 41:811-821. [PMID: 39810350 PMCID: PMC11969446 DOI: 10.1021/acs.langmuir.4c04140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/16/2025]
Abstract
Antibiofouling peptide materials prevent the nonspecific adsorption of proteins on devices, enabling them to perform their designed functions as desired in complex biological environments. Due to their importance, research on antibiofouling peptide materials has been one of the central subjects of interfacial engineering. However, only a few antibiofouling peptide sequences have been developed. This narrow scope of antibiofouling peptide materials limits their capacity to adapt to the broad spectrum of application scenarios. To address this issue, we searched for antibiofouling peptides in the vast sequence pool of the microbiome library using a combination of deep learning-based high-throughput search and molecular dynamics (MD) simulations. A random forest-based model with an ensemble of ten independent classifiers was developed. Each classifier was trained by prompt-tuning the foundational protein language model Evolution Scaling Modeling version 2 (ESM2) on a distinct training data set. We constructed the databases containing the same amount of antibiofouling and biofouling peptide sequences to attenuate the bias of the existing databases. MD simulations were conducted to investigate the interfacial properties of six selected peptide candidates and their interactions with a lysozyme protein. Two known antibiofouling peptides, (glutamic acid (E)-lysine (K))15 and (EK-proline (P))10, and one known fouling peptide, (glycine)30, were used as the reference. The MD simulation results indicate that five of the six peptides present the potential to resist biofouling. Our research implies that deep learning and molecular simulations can be integrated to discover functional peptide materials for interfacial applications.
Collapse
Affiliation(s)
- Ibrahim A Imam
- Department of Chemical and Materials Engineering, University of Kentucky, Lexington, Kentucky 40506, United States
| | - Shea Bailey
- Department of Chemical and Materials Engineering, University of Kentucky, Lexington, Kentucky 40506, United States
- Department of Chemistry and Biochemistry, Butler University, Indianapolis, Indiana 46208, United States
| | - Duolin Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211, United States
| | - Shuai Zeng
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211, United States
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211, United States
| | - Qing Shao
- Department of Chemical and Materials Engineering, University of Kentucky, Lexington, Kentucky 40506, United States
| |
Collapse
|
8
|
Xie X, Deng X, Chen L, Yuan J, Chen H, Wei C, Feng C, Liu X, Qiu G. From Gene to Structure: Unraveling Genomic Dark Matter in Ca. Accumulibacter. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2025; 59:628-639. [PMID: 39699575 DOI: 10.1021/acs.est.4c09948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2024]
Abstract
"Candidatus Accumulibacter" is a unique and pivotal genus of polyphosphate-accumulating organisms prevalent in wastewater treatment plants and plays mainstay roles in the global phosphorus cycle. However, the efforts to fully understand their genetic and metabolic characteristics are largely hindered by major limitations in existing sequence-based annotation methods. Here, we reported an integrated approach combining pangenome analysis, protein structure prediction and clustering, and meta-omic characterization, to uncover genetic and metabolic traits previously unexplored for Ca. Accumulibacter. The identification of a previously overlooked pyrophosphate-fructose 6-phosphate 1-phosphotransferase gene (pfp) suggested that all Ca. Accumulibacter encoded a complete Embden-Meyerhof-Parnas pathway. A homologue of the phosphate-specific transport system accessory protein (PhoU) was suggested to be an inorganic phosphate transport (Pit) accessory protein (Pap) conferring effective and efficient phosphate transport. Additional lineage members were found to encode complete denitrification pathways. A pipeline was built, generating a pan-Ca. Accumulibacter annotation reference database, covering >200,000 proteins and their encoding genes. Benchmarking on 27 Ca. Accumulibacter genomes showed major improvement in the average annotation coverage from 51% to 82%. This pipeline is readily applicable to diverse cultured and uncultured bacteria to establish high-coverage annotation reference databases, facilitating the exploration of genomic dark matter in the bacterial domain.
Collapse
Affiliation(s)
- Xiaojing Xie
- School of Environment and Energy, South China University of Technology, Guangzhou 510006, China
| | - Xuhan Deng
- School of Environment and Energy, South China University of Technology, Guangzhou 510006, China
| | - Liping Chen
- School of Environment and Energy, South China University of Technology, Guangzhou 510006, China
| | - Jing Yuan
- School of Environment and Energy, South China University of Technology, Guangzhou 510006, China
| | - Hang Chen
- School of Environment and Energy, South China University of Technology, Guangzhou 510006, China
| | - Chaohai Wei
- School of Environment and Energy, South China University of Technology, Guangzhou 510006, China
- Guangdong Provincial Key Laboratory of Solid Wastes Pollution Control and Recycling, Guangzhou 510006, China
| | - Chunhua Feng
- School of Environment and Energy, South China University of Technology, Guangzhou 510006, China
- Guangdong Provincial Key Laboratory of Solid Wastes Pollution Control and Recycling, Guangzhou 510006, China
| | - Xianghui Liu
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore 637551, Singapore
| | - Guanglei Qiu
- School of Environment and Energy, South China University of Technology, Guangzhou 510006, China
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore 637551, Singapore
- Guangdong Provincial Key Laboratory of Solid Wastes Pollution Control and Recycling, Guangzhou 510006, China
- The Key Lab of Pollution Control and Ecosystem Restoration in Industry Clusters, Ministry of Education, Guangzhou 510006, China
| |
Collapse
|
9
|
Xiao B, Zhang S, Ainiwaer M, Liu H, Ning L, Hong Y, Sun Y, Ji Y. Deep learning-based assessment of missense variants in the COG4 gene presented with bilateral congenital cataract. BMJ Open Ophthalmol 2025; 10:e001906. [PMID: 39809522 PMCID: PMC11751923 DOI: 10.1136/bmjophth-2024-001906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Accepted: 12/11/2024] [Indexed: 01/16/2025] Open
Abstract
OBJECTIVE We compared the protein structure and pathogenicity of clinically relevant variants of the COG4 gene with AlphaFold2 (AF2), Alpha Missense (AM), and ThermoMPNN for the first time. METHODS AND ANALYSIS The sequences of clinically relevant Cog4 missense variants (one novel identified p.Y714F and three pre-existing p.G512R, p.R729W and p.L769R from Uniprot Q9H9E3) were imported into AF2 for protein structural prediction, and the pathogenicity was estimated using AM and ThermoMPNN. Different pathogenicity metrics were aggregated with principal component analysis (PCA) and further analysed at three levels (amino acid position, substitution and post-translation) based on all possible Cog4 missense variants (n=14 915). RESULTS Localised protein structural impact including change of conformation and amino acid polarity, breakage of hydrogen bond and salt-bridge, and formation of alpha-helix were identified among clinically relevant Cog4 variants. The global structural comparison with multidimensional scaling demonstrated variants with similar protein structures (AF2) tended to exhibit similar clinical and biological phenotypes. The Cog4 p.Y714F variant exhibited greater protein structural similarity to mutated Cog4 found in Saul‒Wilson syndrome (p.G512R) and shared similar clinical phenotype (congenital cataract and psychomotor retardation). PCA of included pathogenic metrics demonstrated p.Y714F occurred at a critical position in Cog4 amino acid sequence with disrupted post-translational phosphorylation. CONCLUSION Deep learning algorithms, including AF2, AM and ThermoMPNN, can be useful for evaluating variant of uncertain significance (VUS) by structural and pathogenicity prediction. Despite classified as VUS (American College of Medical Genetics and Genomics criteria: PM1, PP4), the pathogenicity in this Cog4 variant cannot be ruled out and warrants further investigation.
Collapse
Affiliation(s)
- Binghe Xiao
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, NHC, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Shaohua Zhang
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, NHC, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Maierdanjiang Ainiwaer
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, NHC, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Houyi Liu
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, NHC, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Li Ning
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, NHC, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Yingying Hong
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, NHC, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Yang Sun
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, NHC, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Yinghong Ji
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, NHC, Shanghai, China
- Key laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| |
Collapse
|
10
|
Zielińska K, Udekwu KI, Rudnicki W, Frolova A, Łabaj PP. Healthy microbiome-moving towards functional interpretation. Gigascience 2025; 14:giaf015. [PMID: 40117176 PMCID: PMC11927397 DOI: 10.1093/gigascience/giaf015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Revised: 12/04/2024] [Accepted: 02/05/2025] [Indexed: 03/23/2025] Open
Abstract
BACKGROUND Microbiome-based disease prediction has significant potential as an early, noninvasive marker of multiple health conditions linked to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Microbiome health indices and other computational tools currently proposed in the field often are based on a microbiome's species richness and are completely reliant on taxonomic classification. A resurgent interest in a metabolism-centric, ecological approach has led to an increased understanding of microbiome metabolic and phenotypic complexity, revealing substantial restrictions of taxonomy-reliant approaches. FINDINGS In this study, we introduce a new metagenomic health index developed as an answer to recent developments in microbiome definitions, in an effort to distinguish between healthy and unhealthy microbiomes, here in focus, inflammatory bowel disease (IBD). The novelty of our approach is a shift from a traditional Linnean phylogenetic classification toward a more holistic consideration of the metabolic functional potential underlining ecological interactions between species. Based on well-explored data cohorts, we compare our method and its performance with the most comprehensive indices to date, the taxonomy-based Gut Microbiome Health Index (GMHI), and the high-dimensional principal component analysis (hiPCA) methods, as well as to the standard taxon- and function-based Shannon entropy scoring. After demonstrating better performance on the initially targeted IBD cohorts, in comparison with other methods, we retrain our index on an additional 27 datasets obtained from different clinical conditions and validate our index's ability to distinguish between healthy and disease states using a variety of complementary benchmarking approaches. Finally, we demonstrate its superiority over the GMHI and the hiPCA on a longitudinal COVID-19 cohort and highlight the distinct robustness of our method to sequencing depth. CONCLUSIONS Overall, we emphasize the potential of this metagenomic approach and advocate a shift toward functional approaches to better understand and assess microbiome health as well as provide directions for future index enhancements. Our method, q2-predict-dysbiosis (Q2PD), is freely available (https://github.com/Kizielins/q2-predict-dysbiosis).
Collapse
Affiliation(s)
- Kinga Zielińska
- Małopolska Centre of Biotechnology, Jagiellonian University, 30-387 Krakow, Poland
| | - Klas I Udekwu
- Department of Biological Sciences, Bioinformatics and Computational Biology Program, University of Idaho, Moscow, ID 83843, USA
- Swedish Environmental Epidemiology Centre, Department of Aquatic Sciences and Assessment, Swedish University of Agricultural Sciences, Uppsala SE75007, Sweden
| | - Witold Rudnicki
- Faculty of Computer Science, University of Białystok, 15-351 Białystok, Poland
- Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, 03-046 Warsaw, Poland
| | - Alina Frolova
- Institute of Molecular Biology and Genetics of National Academy of Sciences of Ukraine, 03143 Kyiv, Ukraine
- Kyiv Academic University, 03142 Kyiv, Ukraine
| | - Paweł P Łabaj
- Małopolska Centre of Biotechnology, Jagiellonian University, 30-387 Krakow, Poland
| |
Collapse
|
11
|
Pinto Y, Bhatt AS. Sequencing-based analysis of microbiomes. Nat Rev Genet 2024; 25:829-845. [PMID: 38918544 DOI: 10.1038/s41576-024-00746-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/15/2024] [Indexed: 06/27/2024]
Abstract
Microbiomes occupy a range of niches and, in addition to having diverse compositions, they have varied functional roles that have an impact on agriculture, environmental sciences, and human health and disease. The study of microbiomes has been facilitated by recent technological and analytical advances, such as cheaper and higher-throughput DNA and RNA sequencing, improved long-read sequencing and innovative computational analysis methods. These advances are providing a deeper understanding of microbiomes at the genomic, transcriptional and translational level, generating insights into their function and composition at resolutions beyond the species level.
Collapse
Affiliation(s)
- Yishay Pinto
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Medicine, Divisions of Hematology and Blood & Marrow Transplantation, Stanford University, Stanford, CA, USA
| | - Ami S Bhatt
- Department of Genetics, Stanford University, Stanford, CA, USA.
- Department of Medicine, Divisions of Hematology and Blood & Marrow Transplantation, Stanford University, Stanford, CA, USA.
| |
Collapse
|
12
|
Roth K, Rana YS, Worobo R, Snyder AB. Alicyclobacillus suci produces more guaiacol in media and has duplicate copies of vdcC compared to closely related Alicyclobacillus acidoterrestris. Appl Environ Microbiol 2024; 90:e0042224. [PMID: 39382294 PMCID: PMC11577841 DOI: 10.1128/aem.00422-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Accepted: 09/15/2024] [Indexed: 10/10/2024] Open
Abstract
Some species of the genus Alicyclobacillus cause spoilage in juices and other beverages due to the production of guaiacol, a phenolic compound, and off-aroma. However, little is known about the genomic determinants of guaiacol production across the genus. In this study, we found that several of the genes significantly enriched in guaiacol-producing Alicyclobacillus spp. are associated with oxidative stress response, including vdcC, a phenolic acid decarboxylase putatively responsible for guaiacol synthesis. The food industry recognizes Alicyclobacillus acidoterrestris as the primary guaiacol-producing species found in beverages, though that species was recently split into two closely related yet genetically distinct species, Alicyclobacillus suci and A. acidoterrestris. We found that strains of A. suci (63.0 ± 14.2 ppm) produced significantly (P < 0.01) more guaiacol on average in media than did strains of A. acidoterrestris (25.2 ± 7.0 ppm). Additionally, A. suci and Alicyclobacillus fastidiosus genomes each had duplicate copies of vdcC, while only a single copy of vdcC was found in the genomes of A. acidoterrestris, Alicyclobacillus acidiphilus, and Alicyclobacillus herbarius. Although the food industry has not historically differentiated between A. suci and A. acidoterrestris, it may be increasingly important to target the species with greater spoilage potential. Therefore, we also demonstrated that sequencing a single locus, such as the full-length 16S region or rpoB, is sufficient to differentiate between A. acidoterrestris and A. suci. IMPORTANCE Microbial spoilage increases food waste. To address that challenge, it is critical to recognize and control those microbial groups with the greatest spoilage potential. Non-specific targeting of broad microbial groups (e.g., the genus of Alicyclobacillus) in which only some members cause food spoilage results in untenable, overly broad interventions. Much of the food industry does not differentiate between guaiacol-producing and non-guaiacol-producing Alicyclobacillus species. This is overly broad because Alicyclobacillus spp. which cannot produce guaiacol can be present in beverages without causing spoilage. Furthermore, no distinction is made between Alicyclobacillus suci and Alicyclobacillus acidoterrestris because A. suci is newly split from A. acidoterrestris and most of the food industry still considers them to be the same. However, these findings indicate that A. suci may have greater spoilage potential than A. acidoterrestris due to differences in their genomic determinants for guaiacol production.
Collapse
Affiliation(s)
- Katerina Roth
- Department of Food Science, Cornell University, Ithaca, New York, USA
| | | | - Randy Worobo
- Department of Food Science, Cornell University, Ithaca, New York, USA
| | - Abigail B. Snyder
- Department of Food Science, Cornell University, Ithaca, New York, USA
| |
Collapse
|
13
|
Szydlowski LM, Bulbul AA, Simpson AC, Kaya DE, Singh NK, Sezerman UO, Łabaj PP, Kosciolek T, Venkateswaran K. Adaptation to space conditions of novel bacterial species isolated from the International Space Station revealed by functional gene annotations and comparative genome analysis. MICROBIOME 2024; 12:190. [PMID: 39363369 PMCID: PMC11451251 DOI: 10.1186/s40168-024-01916-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 08/21/2024] [Indexed: 10/05/2024]
Abstract
BACKGROUND The extreme environment of the International Space Station (ISS) puts selective pressure on microorganisms unintentionally introduced during its 20+ years of service as a low-orbit science platform and human habitat. Such pressure leads to the development of new features not found in the Earth-bound relatives, which enable them to adapt to unfavorable conditions. RESULTS In this study, we generated the functional annotation of the genomes of five newly identified species of Gram-positive bacteria, four of which are non-spore-forming and one spore-forming, all isolated from the ISS. Using a deep-learning based tool-deepFRI-we were able to functionally annotate close to 100% of protein-coding genes in all studied species, overcoming other annotation tools. Our comparative genomic analysis highlights common characteristics across all five species and specific genetic traits that appear unique to these ISS microorganisms. Proteome analysis mirrored these genomic patterns, revealing similar traits. The collective annotations suggest adaptations to life in space, including the management of hypoosmotic stress related to microgravity via mechanosensitive channel proteins, increased DNA repair activity to counteract heightened radiation exposure, and the presence of mobile genetic elements enhancing metabolism. In addition, our findings suggest the evolution of certain genetic traits indicative of potential pathogenic capabilities, such as small molecule and peptide synthesis and ATP-dependent transporters. These traits, exclusive to the ISS microorganisms, further substantiate previous reports explaining why microbes exposed to space conditions demonstrate enhanced antibiotic resistance and pathogenicity. CONCLUSION Our findings indicate that the microorganisms isolated from ISS we studied have adapted to life in space. Evidence such as mechanosensitive channel proteins, increased DNA repair activity, as well as metallopeptidases and novel S-layer oxidoreductases suggest a convergent adaptation among these diverse microorganisms, potentially complementing one another within the context of the microbiome. The common genes that facilitate adaptation to the ISS environment may enable bioproduction of essential biomolecules need during future space missions, or serve as potential drug targets, if these microorganisms pose health risks. Video Abstract.
Collapse
Affiliation(s)
- Lukasz M Szydlowski
- Malopolska Centre of Biotechnology, Jagiellonian University, Gronostajowa 7A, Krakow, 30-387, Malopolska, Poland
- Sano Centre for Computational Personalized Medicine, Czarnowiejska 36, Krakow, 30-054, Malopolskie, Poland
| | - Alper A Bulbul
- Biostatistics and Medical Informatics Department, M. A. A. Acibadem University, İçerenköy, Kayıcdağı Cd.32, Istanbul, 34752, Turkey
| | - Anna C Simpson
- Biotechnology and Planetary Protection Group, Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109, CA, USA
| | - Deniz E Kaya
- Biostatistics and Medical Informatics Department, M. A. A. Acibadem University, İçerenköy, Kayıcdağı Cd.32, Istanbul, 34752, Turkey
| | - Nitin K Singh
- Biotechnology and Planetary Protection Group, Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109, CA, USA
| | - Ugur O Sezerman
- Biostatistics and Medical Informatics Department, M. A. A. Acibadem University, İçerenköy, Kayıcdağı Cd.32, Istanbul, 34752, Turkey
| | - Paweł P Łabaj
- Malopolska Centre of Biotechnology, Jagiellonian University, Gronostajowa 7A, Krakow, 30-387, Malopolska, Poland
| | - Tomasz Kosciolek
- Malopolska Centre of Biotechnology, Jagiellonian University, Gronostajowa 7A, Krakow, 30-387, Malopolska, Poland.
- Department of Data Science and Engineering, Silesian University of Technology, Akademicka 2A, Gliwice, 44-100, Slaskie, Poland.
- Sano Centre for Computational Personalized Medicine, Czarnowiejska 36, Krakow, 30-054, Malopolskie, Poland.
| | - Kasthuri Venkateswaran
- Biotechnology and Planetary Protection Group, Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109, CA, USA.
| |
Collapse
|
14
|
Duller S, Vrbancic S, Szydłowski Ł, Mahnert A, Blohs M, Predl M, Kumpitsch C, Zrim V, Högenauer C, Kosciolek T, Schmitz RA, Eberhard A, Dragovan M, Schmidberger L, Zurabischvili T, Weinberger V, Moser AM, Kolb D, Pernitsch D, Mohammadzadeh R, Kühnast T, Rattei T, Moissl-Eichinger C. Targeted isolation of Methanobrevibacter strains from fecal samples expands the cultivated human archaeome. Nat Commun 2024; 15:7593. [PMID: 39217206 PMCID: PMC11366006 DOI: 10.1038/s41467-024-52037-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Accepted: 08/21/2024] [Indexed: 09/04/2024] Open
Abstract
Archaea are vital components of the human microbiome, yet their study within the gastrointestinal tract (GIT) is limited by the scarcity of cultured representatives. Our study presents a method for the targeted enrichment and isolation of methanogenic archaea from human fecal samples. The procedure combines methane breath testing, in silico metabolic modeling, media optimization, FACS, dilution series, and genomic sequencing through Nanopore technology. Additional analyzes include the co-cultured bacteriome, comparative genomics of archaeal genomes, functional comparisons, and structure-based protein function prediction of unknown differential traits. Successful establishment of stable archaeal cultures from 14 out of 16 fecal samples yielded nine previously uncultivated strains, eight of which are absent from a recent archaeome genome catalog. Comparative genomic and functional assessments of Methanobrevibacter smithii and Candidatus Methanobrevibacter intestini strains from individual donors revealed features potentially associated with gastrointestinal diseases. Our work broadens available archaeal representatives for GIT studies, and offers insights into Candidatus Methanobrevibacter intestini genomes' adaptability in critical microbiome contexts.
Collapse
Affiliation(s)
- Stefanie Duller
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Simone Vrbancic
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Łukasz Szydłowski
- Malopolska Centre of Biotechnology, Jagiellonian University in Krakow, Krakow, Poland
- Sano Centre for Computational Medicine, Krakow, Poland
| | - Alexander Mahnert
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
- BioTechMed Graz, Graz, Austria
| | - Marcus Blohs
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Michael Predl
- Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria
- Doctoral School Microbiology and Environmental Science, University of Vienna, Vienna, Austria
| | - Christina Kumpitsch
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
- BioTechMed Graz, Graz, Austria
| | - Verena Zrim
- Center for Medical Research, Medical University of Graz, Graz, Austria
| | - Christoph Högenauer
- Division of Gastroenterology and Hepatology, Department of Internal Medicine, Medical University of Graz, Graz, Austria
| | - Tomasz Kosciolek
- Malopolska Centre of Biotechnology, Jagiellonian University in Krakow, Krakow, Poland
- Sano Centre for Computational Medicine, Krakow, Poland
- Department of Data Science and Engineering, Silesian University of Technology, Gliwice, Poland
| | - Ruth A Schmitz
- Institute for General Microbiology, Christian Albrechts University, Kiel, Germany
| | - Anna Eberhard
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Melanie Dragovan
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Laura Schmidberger
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Tamara Zurabischvili
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Viktoria Weinberger
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Adrian Mathias Moser
- Division of Gastroenterology and Hepatology, Department of Internal Medicine, Medical University of Graz, Graz, Austria
| | - Dagmar Kolb
- Core Facility Ultrastructure Analysis, Medical University of Graz, Graz, Austria
- Gottfried Schatz Research Center, Medical University of Graz, Graz, Austria
| | - Dominique Pernitsch
- Core Facility Ultrastructure Analysis, Medical University of Graz, Graz, Austria
| | - Rokhsareh Mohammadzadeh
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Torben Kühnast
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria
| | - Thomas Rattei
- Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria
| | - Christine Moissl-Eichinger
- D&R Institute of Hygiene, Microbiology and Environmental Medicine, Medical University of Graz, Graz, Austria.
- BioTechMed Graz, Graz, Austria.
| |
Collapse
|
15
|
Zhou J, Huang M. Navigating the landscape of enzyme design: from molecular simulations to machine learning. Chem Soc Rev 2024; 53:8202-8239. [PMID: 38990263 DOI: 10.1039/d4cs00196f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/12/2024]
Abstract
Global environmental issues and sustainable development call for new technologies for fine chemical synthesis and waste valorization. Biocatalysis has attracted great attention as the alternative to the traditional organic synthesis. However, it is challenging to navigate the vast sequence space to identify those proteins with admirable biocatalytic functions. The recent development of deep-learning based structure prediction methods such as AlphaFold2 reinforced by different computational simulations or multiscale calculations has largely expanded the 3D structure databases and enabled structure-based design. While structure-based approaches shed light on site-specific enzyme engineering, they are not suitable for large-scale screening of potential biocatalysts. Effective utilization of big data using machine learning techniques opens up a new era for accelerated predictions. Here, we review the approaches and applications of structure-based and machine-learning guided enzyme design. We also provide our view on the challenges and perspectives on effectively employing enzyme design approaches integrating traditional molecular simulations and machine learning, and the importance of database construction and algorithm development in attaining predictive ML models to explore the sequence fitness landscape for the design of admirable biocatalysts.
Collapse
Affiliation(s)
- Jiahui Zhou
- School of Chemistry and Chemical Engineering, Queen's University, David Keir Building, Stranmillis Road, Belfast BT9 5AG, Northern Ireland, UK.
| | - Meilan Huang
- School of Chemistry and Chemical Engineering, Queen's University, David Keir Building, Stranmillis Road, Belfast BT9 5AG, Northern Ireland, UK.
| |
Collapse
|
16
|
Zhao F, Qiu J, Xiang D, Jiao P, Cao Y, Xu Q, Qiao D, Xu H, Cao Y. deepAMPNet: a novel antimicrobial peptide predictor employing AlphaFold2 predicted structures and a bi-directional long short-term memory protein language model. PeerJ 2024; 12:e17729. [PMID: 39040937 PMCID: PMC11262304 DOI: 10.7717/peerj.17729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Accepted: 06/20/2024] [Indexed: 07/24/2024] Open
Abstract
Background Global public health is seriously threatened by the escalating issue of antimicrobial resistance (AMR). Antimicrobial peptides (AMPs), pivotal components of the innate immune system, have emerged as a potent solution to AMR due to their therapeutic potential. Employing computational methodologies for the prompt recognition of these antimicrobial peptides indeed unlocks fresh perspectives, thereby potentially revolutionizing antimicrobial drug development. Methods In this study, we have developed a model named as deepAMPNet. This model, which leverages graph neural networks, excels at the swift identification of AMPs. It employs structures of antimicrobial peptides predicted by AlphaFold2, encodes residue-level features through a bi-directional long short-term memory (Bi-LSTM) protein language model, and constructs adjacency matrices anchored on amino acids' contact maps. Results In a comparative study with other state-of-the-art AMP predictors on two external independent test datasets, deepAMPNet outperformed in accuracy. Furthermore, in terms of commonly accepted evaluation matrices such as AUC, Mcc, sensitivity, and specificity, deepAMPNet achieved the highest or highly comparable performances against other predictors. Conclusion deepAMPNet interweaves both structural and sequence information of AMPs, stands as a high-performance identification model that propels the evolution and design in antimicrobial peptide pharmaceuticals. The data and code utilized in this study can be accessed at https://github.com/Iseeu233/deepAMPNet.
Collapse
Affiliation(s)
- Fei Zhao
- Microbiology and Metabolic Engineering Laboratory of Sichuan Province, College of Life Science, Sichuan University, Chengdu, Sichuan, China
| | - Junhui Qiu
- Microbiology and Metabolic Engineering Laboratory of Sichuan Province, College of Life Science, Sichuan University, Chengdu, Sichuan, China
| | - Dongyou Xiang
- Microbiology and Metabolic Engineering Laboratory of Sichuan Province, College of Life Science, Sichuan University, Chengdu, Sichuan, China
| | - Pengrui Jiao
- Microbiology and Metabolic Engineering Laboratory of Sichuan Province, College of Life Science, Sichuan University, Chengdu, Sichuan, China
| | - Yu Cao
- Microbiology and Metabolic Engineering Laboratory of Sichuan Province, College of Life Science, Sichuan University, Chengdu, Sichuan, China
| | - Qingrui Xu
- Microbiology and Metabolic Engineering Laboratory of Sichuan Province, College of Life Science, Sichuan University, Chengdu, Sichuan, China
| | - Dairong Qiao
- Microbiology and Metabolic Engineering Laboratory of Sichuan Province, College of Life Science, Sichuan University, Chengdu, Sichuan, China
| | - Hui Xu
- Microbiology and Metabolic Engineering Laboratory of Sichuan Province, College of Life Science, Sichuan University, Chengdu, Sichuan, China
| | - Yi Cao
- Microbiology and Metabolic Engineering Laboratory of Sichuan Province, College of Life Science, Sichuan University, Chengdu, Sichuan, China
| |
Collapse
|
17
|
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, Strauss CEM, Leman JK, Cho K, Bonneau R. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol 2024; 42:975-985. [PMID: 37679542 PMCID: PMC11180608 DOI: 10.1038/s41587-023-01917-2] [Citation(s) in RCA: 29] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 07/26/2023] [Indexed: 09/09/2023]
Abstract
Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases. It is trained to accurately predict TM-scores as a metric of structural similarity directly from sequence pairs without the need for intermediate computation or solution of structures. Once structurally similar proteins have been identified, DeepBLAST can structurally align proteins using only sequence information by identifying structurally homologous regions between proteins. It outperforms traditional sequence alignment methods and performs similarly to structure-based alignment methods. We show the merits of TM-Vec and DeepBLAST on a variety of datasets, including better identification of remotely homologous proteins compared with state-of-the-art sequence alignment and structure prediction methods.
Collapse
Grants
- R35GM122515 National Science Foundation (NSF)
- IOS-1546218 National Science Foundation (NSF)
- R35 GM122515 NIGMS NIH HHS
- R01 DK103358 NIDDK NIH HHS
- CBET- 1728858 National Science Foundation (NSF)
- R01 AI130945 NIAID NIH HHS
- This research was supported by NIH R01DK103358, the Simons Foundation, NSF- IOS-1546218, R35GM122515, NSF CBET- 1728858, NIH R01AI130945, to T.H. This research was supported by the intramural research program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) to J.T.M. This research was supported by the Flatiron Institute as part of the Simons Foundation to Robert Blackwell, J.K.L., and N.C. This research was supported by Los Alamos National Lab to C.S. This research was supported by the Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI), Samsung Research (Improving Deep Learning using Latent Structure), and NSF Award 1922658 to K.C.
- Simons Foundation
- U.S. Department of Health & Human Services | NIH | Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD)
Collapse
Affiliation(s)
- Tymor Hamamsy
- Center for Data Science, New York University, New York, NY, USA
| | - James T Morton
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Robert Blackwell
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
- Prescient Design, New York, NY, USA
| | - Nicholas Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | | | | | - Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Kyunghyun Cho
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- CIFAR, Toronto, Ontario, Canada.
| | - Richard Bonneau
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| |
Collapse
|
18
|
Gaschignard G, Millet M, Bruley A, Benzerara K, Dezi M, Skouri-Panet F, Duprat E, Callebaut I. AlphaFold2-guided description of CoBaHMA, a novel family of bacterial domains within the heavy-metal-associated superfamily. Proteins 2024; 92:776-794. [PMID: 38258321 DOI: 10.1002/prot.26668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 12/22/2023] [Accepted: 01/01/2024] [Indexed: 01/24/2024]
Abstract
Three-dimensional (3D) structure information, now available at the proteome scale, may facilitate the detection of remote evolutionary relationships in protein superfamilies. Here, we illustrate this with the identification of a novel family of protein domains related to the ferredoxin-like superfold, by combining (i) transitive sequence similarity searches, (ii) clustering approaches, and (iii) the use of AlphaFold2 3D structure models. Domains of this family were initially identified in relation with the intracellular biomineralization of calcium carbonates by Cyanobacteria. They are part of the large heavy-metal-associated (HMA) superfamily, departing from the latter by specific sequence and structural features. In particular, most of them share conserved basic amino acids (hence their name CoBaHMA for Conserved Basic residues HMA), forming a positively charged surface, which is likely to interact with anionic partners. CoBaHMA domains are found in diverse modular organizations in bacteria, existing in the form of monodomain proteins or as part of larger proteins, some of which are membrane proteins involved in transport or lipid metabolism. This suggests that the CoBaHMA domains may exert a regulatory function, involving interactions with anionic lipids. This hypothesis might have a particular resonance in the context of the compartmentalization observed for cyanobacterial intracellular calcium carbonates.
Collapse
Affiliation(s)
- Geoffroy Gaschignard
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Maxime Millet
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Apolline Bruley
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Karim Benzerara
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Manuela Dezi
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Feriel Skouri-Panet
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Elodie Duprat
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Isabelle Callebaut
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| |
Collapse
|
19
|
Bou Dagher L, Madern D, Malbos P, Brochier-Armanet C. Persistent homology reveals strong phylogenetic signal in 3D protein structures. PNAS NEXUS 2024; 3:pgae158. [PMID: 38689707 PMCID: PMC11058471 DOI: 10.1093/pnasnexus/pgae158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 04/01/2024] [Indexed: 05/02/2024]
Abstract
Changes that occur in proteins over time provide a phylogenetic signal that can be used to decipher their evolutionary history and the relationships between organisms. Sequence comparison is the most common way to access this phylogenetic signal, while those based on 3D structure comparisons are still in their infancy. In this study, we propose an effective approach based on Persistent Homology Theory (PH) to extract the phylogenetic information contained in protein structures. PH provides efficient and robust algorithms for extracting and comparing geometric features from noisy datasets at different spatial resolutions. PH has a growing number of applications in the life sciences, including the study of proteins (e.g. classification, folding). However, it has never been used to study the phylogenetic signal they may contain. Here, using 518 protein families, representing 22,940 protein sequences and structures, from 10 major taxonomic groups, we show that distances calculated with PH from protein structures correlate strongly with phylogenetic distances calculated from protein sequences, at both small and large evolutionary scales. We test several methods for calculating PH distances and propose some refinements to improve their relevance for addressing evolutionary questions. This work opens up new perspectives in evolutionary biology by proposing an efficient way to access the phylogenetic signal contained in protein structures, as well as future developments of topological analysis in the life sciences.
Collapse
Affiliation(s)
- Léa Bou Dagher
- Université Claude Bernard Lyon 1, CNRS, VetAgro Sup, Laboratoire de Biométrie et BiologieÉvolutive, UMR5558, F-69622 Villeurbanne, France
- Université Claude Bernard Lyon 1, CNRS, Institut Camille Jordan, UMR5208, F-69622 Villeurbanne, France
- Université Libanaise, Laboratoire de Mathématiques, École Doctorale en Science et Technologie, PO BOX 5 Hadath, Liban
| | - Dominique Madern
- University Grenoble Alpes, CEA, CNRS, IBS, 38000 Grenoble, France
| | - Philippe Malbos
- Université Claude Bernard Lyon 1, CNRS, Institut Camille Jordan, UMR5208, F-69622 Villeurbanne, France
| | - Céline Brochier-Armanet
- Université Claude Bernard Lyon 1, CNRS, VetAgro Sup, Laboratoire de Biométrie et BiologieÉvolutive, UMR5558, F-69622 Villeurbanne, France
| |
Collapse
|
20
|
Greener JG, Jamali K. Fast protein structure searching using structure graph embeddings. BIOINFORMATICS ADVANCES 2024; 5:vbaf042. [PMID: 40196750 PMCID: PMC11974391 DOI: 10.1093/bioadv/vbaf042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/11/2025] [Accepted: 03/03/2025] [Indexed: 04/09/2025]
Abstract
Comparing and searching protein structures independent of primary sequence has proved useful for remote homology detection, function annotation, and protein classification. Fast and accurate methods to search with structures will be essential to make use of the vast databases that have recently become available, in the same way that fast protein sequence searching underpins much of bioinformatics. We train a simple graph neural network using supervised contrastive learning to learn a low-dimensional embedding of protein domains. Availability and implementation The method, called Progres, is available as software at https://github.com/greener-group/progres and as a web server at https://progres.mrc-lmb.cam.ac.uk. It has accuracy comparable to the best current methods and can search the AlphaFold database TED domains in a 10th of a second per query on CPU.
Collapse
Affiliation(s)
- Joe G Greener
- Medical Research Council Laboratory of Molecular Biology, Cambridge, CB2 0QH, United Kingdom
| | - Kiarash Jamali
- Medical Research Council Laboratory of Molecular Biology, Cambridge, CB2 0QH, United Kingdom
| |
Collapse
|
21
|
Wenzel M, Grüner E, Strodthoff N. Insights into the inner workings of transformer models for protein function prediction. Bioinformatics 2024; 40:btae031. [PMID: 38244570 PMCID: PMC10950482 DOI: 10.1093/bioinformatics/btae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 12/14/2023] [Accepted: 01/16/2024] [Indexed: 01/22/2024] Open
Abstract
MOTIVATION We explored how explainable artificial intelligence (XAI) can help to shed light into the inner workings of neural networks for protein function prediction, by extending the widely used XAI method of integrated gradients such that latent representations inside of transformer models, which were finetuned to Gene Ontology term and Enzyme Commission number prediction, can be inspected too. RESULTS The approach enabled us to identify amino acids in the sequences that the transformers pay particular attention to, and to show that these relevant sequence parts reflect expectations from biology and chemistry, both in the embedding layer and inside of the model, where we identified transformer heads with a statistically significant correspondence of attribution maps with ground truth sequence annotations (e.g. transmembrane regions, active sites) across many proteins. AVAILABILITY AND IMPLEMENTATION Source code can be accessed at https://github.com/markuswenzel/xai-proteins.
Collapse
Affiliation(s)
- Markus Wenzel
- Department of Artificial Intelligence, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Einsteinufer 37, 10587 Berlin, Germany
| | - Erik Grüner
- Department of Artificial Intelligence, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Einsteinufer 37, 10587 Berlin, Germany
| | - Nils Strodthoff
- School VI - Medicine and Health Services, Carl von Ossietzky University of Oldenburg, Ammerländer Heerstr. 114-118, 26129 Oldenburg, Germany
| |
Collapse
|
22
|
Cheon H, Kim JH, Kim JS, Park JB. Valorization of single-carbon chemicals by using carboligases as key enzymes. Curr Opin Biotechnol 2024; 85:103047. [PMID: 38128199 DOI: 10.1016/j.copbio.2023.103047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 11/23/2023] [Accepted: 11/24/2023] [Indexed: 12/23/2023]
Abstract
Single-carbon (C1) biorefinery plays a key role in the consumption of global greenhouse gases and a circular carbon economy. Thereby, we have focused on the valorization of C1 compounds (e.g. methanol, formaldehyde, and formate) into multicarbon products, including bioplastic monomers, glycolate, and ethylene glycol. For instance, methanol, derived from the oxidation of CH4, can be converted into glycolate, ethylene glycol, or erythrulose via formaldehyde and glycolaldehyde, employing C1 and/or C2 carboligases as essential enzymes. Escherichia coli was engineered to convert formate, produced from CO via CO2 or from CO2 directly, into glycolate. Recent progress in the design of biotransformation pathways, enzyme discovery, and engineering, as well as whole-cell biocatalyst engineering for C1 biorefinery, was addressed in this review.
Collapse
Affiliation(s)
- Huijin Cheon
- Department of Food Science and Biotechnology, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Jun-Hong Kim
- Department of Chemistry, Chonnam National University, Gwangju 61186, Republic of Korea
| | - Jeong-Sun Kim
- Department of Chemistry, Chonnam National University, Gwangju 61186, Republic of Korea.
| | - Jin-Byung Park
- Department of Food Science and Biotechnology, Ewha Womans University, Seoul 03760, Republic of Korea.
| |
Collapse
|
23
|
Wang W, Shuai Y, Yang Q, Zhang F, Zeng M, Li M. A comprehensive computational benchmark for evaluating deep learning-based protein function prediction approaches. Brief Bioinform 2024; 25:bbae050. [PMID: 38388682 PMCID: PMC10883809 DOI: 10.1093/bib/bbae050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 01/17/2024] [Accepted: 01/26/2024] [Indexed: 02/24/2024] Open
Abstract
Proteins play an important role in life activities and are the basic units for performing functions. Accurately annotating functions to proteins is crucial for understanding the intricate mechanisms of life and developing effective treatments for complex diseases. Traditional biological experiments struggle to keep pace with the growing number of known proteins. With the development of high-throughput sequencing technology, a wide variety of biological data provides the possibility to accurately predict protein functions by computational methods. Consequently, many computational methods have been proposed. Due to the diversity of application scenarios, it is necessary to conduct a comprehensive evaluation of these computational methods to determine the suitability of each algorithm for specific cases. In this study, we present a comprehensive benchmark, BeProf, to process data and evaluate representative computational methods. We first collect the latest datasets and analyze the data characteristics. Then, we investigate and summarize 17 state-of-the-art computational methods. Finally, we propose a novel comprehensive evaluation metric, design eight application scenarios and evaluate the performance of existing methods on these scenarios. Based on the evaluation, we provide practical recommendations for different scenarios, enabling users to select the most suitable method for their specific needs. All of these servers can be obtained from https://csuligroup.com/BEPROF and https://github.com/CSUBioGroup/BEPROF.
Collapse
Affiliation(s)
- Wenkang Wang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Yunyan Shuai
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Qiurong Yang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| |
Collapse
|
24
|
Segura J, Rose Y, Bi C, Duarte J, Burley SK, Bittrich S. RCSB Protein Data Bank: visualizing groups of experimentally determined PDB structures alongside computed structure models of proteins. FRONTIERS IN BIOINFORMATICS 2023; 3:1311287. [PMID: 38111685 PMCID: PMC10726007 DOI: 10.3389/fbinf.2023.1311287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 11/17/2023] [Indexed: 12/20/2023] Open
Abstract
Recent advances in Artificial Intelligence and Machine Learning (e.g., AlphaFold, RosettaFold, and ESMFold) enable prediction of three-dimensional (3D) protein structures from amino acid sequences alone at accuracies comparable to lower-resolution experimental methods. These tools have been employed to predict structures across entire proteomes and the results of large-scale metagenomic sequence studies, yielding an exponential increase in available biomolecular 3D structural information. Given the enormous volume of this newly computed biostructure data, there is an urgent need for robust tools to manage, search, cluster, and visualize large collections of structures. Equally important is the capability to efficiently summarize and visualize metadata, biological/biochemical annotations, and structural features, particularly when working with vast numbers of protein structures of both experimental origin from the Protein Data Bank (PDB) and computationally-predicted models. Moreover, researchers require advanced visualization techniques that support interactive exploration of multiple sequences and structural alignments. This paper introduces a suite of tools provided on the RCSB PDB research-focused web portal RCSB. org, tailor-made for efficient management, search, organization, and visualization of this burgeoning corpus of 3D macromolecular structure data.
Collapse
Affiliation(s)
- Joan Segura
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
| | - Yana Rose
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
| | - Chunxiao Bi
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
| | - Jose Duarte
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
| | - Stephen K. Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
- Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, United States
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
| | - Sebastian Bittrich
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
| |
Collapse
|
25
|
Zamani K, Mohsenpour M, Malboobi MA. Predicting the allergenic risk of Phosphite-NAD +-Oxidoreductase and purple acid phosphatase 17 proteins in genetically modified canola using bioinformatic approaches. Food Chem Toxicol 2023; 182:114094. [PMID: 37925014 DOI: 10.1016/j.fct.2023.114094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 09/25/2023] [Accepted: 10/08/2023] [Indexed: 11/06/2023]
Abstract
Recent advancements in the generation of high-throughput multi-omics data have provided a vast array of candidate genes for the genetic engineering of plants. However, as part of their safety assessment, newly expressed proteins in genetically modified crops must be evaluated for potential cross-reactivity with known allergens. In this study, we developed transgenic canola plants expressing the Arabidopsis thaliana PAP17 gene and a novel selectable marker composed of the ptxD gene from Pseudomonas stutzeri. To evaluate the potential allergenic cross-reactivity of the AtPAP17 and PTXD proteins expressed in transgenic canola, we applied a comprehensive approach utilizing sequence-based, motif-based, and 3D structure-based analyses. Our results demonstrate that the risk of conferring cross-reactivity with known allergens is negligible, indicating that the expression of these proteins in transgenic canola poses a low allergenic risk.
Collapse
Affiliation(s)
- Katayoun Zamani
- Department of Genetic Engineering and Biosafety, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research, Education and Extension Organization (AREEO), P.O. Box 31359-33151, Karaj, Iran.
| | - Motahhareh Mohsenpour
- Department of Genetic Engineering and Biosafety, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research, Education and Extension Organization (AREEO), P.O. Box 31359-33151, Karaj, Iran
| | - Mohammad Ali Malboobi
- Department of Plant Biotechnology, National Institute of Genetic Engineering and Biotechnology, P.O. Box 14965-161, Tehran, Iran
| |
Collapse
|
26
|
Mahlich Y, Zhu C, Chung H, Velaga PK, De Paolis Kaluza M, Radivojac P, Friedberg I, Bromberg Y. Learning from the unknown: exploring the range of bacterial functionality. Nucleic Acids Res 2023; 51:10162-10175. [PMID: 37739408 PMCID: PMC10602916 DOI: 10.1093/nar/gkad757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 09/11/2023] [Indexed: 09/24/2023] Open
Abstract
Determining the repertoire of a microbe's molecular functions is a central question in microbial biology. Modern techniques achieve this goal by comparing microbial genetic material against reference databases of functionally annotated genes/proteins or known taxonomic markers such as 16S rRNA. Here, we describe a novel approach to exploring bacterial functional repertoires without reference databases. Our Fusion scheme establishes functional relationships between bacteria and assigns organisms to Fusion-taxa that differ from otherwise defined taxonomic clades. Three key findings of our work stand out. First, bacterial functional comparisons outperform marker genes in assigning taxonomic clades. Fusion profiles are also better for this task than other functional annotation schemes. Second, Fusion-taxa are robust to addition of novel organisms and are, arguably, able to capture the environment-driven bacterial diversity. Finally, our alignment-free nucleic acid-based Siamese Neural Network model, created using Fusion functions, enables finding shared functionality of very distant, possibly structurally different, microbial homologs. Our work can thus help annotate functional repertoires of bacterial organisms and further guide our understanding of microbial communities.
Collapse
Affiliation(s)
- Yannick Mahlich
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - Chengsheng Zhu
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
- Xbiome Inc., 1 Broadway, 14th fl, Cambridge, MA 02142, USA
| | - Henri Chung
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
- Interdepartmental program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA 50011, USA
| | - Pavan K Velaga
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - M Clara De Paolis Kaluza
- Khoury College of Computer Sciences, Northeastern University, 177 Huntington Avenue, Boston, MA 02115, USA
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, 177 Huntington Avenue, Boston, MA 02115, USA
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
- Interdepartmental program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA 50011, USA
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
- Department of Biology, Emory University, 1510 Clifton Road NE, Atlanta, GA 30322, USA
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA 30322, USA
| |
Collapse
|
27
|
Thurimella K, Mohamed AMT, Graham DB, Owens RM, La Rosa SL, Plichta DR, Bacallado S, Xavier RJ. Protein Language Models Uncover Carbohydrate-Active Enzyme Function in Metagenomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.23.563620. [PMID: 37961379 PMCID: PMC10634757 DOI: 10.1101/2023.10.23.563620] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
In metagenomics, the pool of uncharacterized microbial enzymes presents a challenge for functional annotation. Among these, carbohydrate-active enzymes (CAZymes) stand out due to their pivotal roles in various biological processes related to host health and nutrition. Here, we present CAZyLingua, the first tool that harnesses protein language model embeddings to build a deep learning framework that facilitates the annotation of CAZymes in metagenomic datasets. Our benchmarking results showed on average a higher F1 score (reflecting an average of precision and recall) on the annotated genomes of Bacteroides thetaiotaomicron, Eggerthella lenta and Ruminococcus gnavus compared to the traditional sequence homology-based method in dbCAN2. We applied our tool to a paired mother/infant longitudinal dataset and revealed unannotated CAZymes linked to microbial development during infancy. When applied to metagenomic datasets derived from patients affected by fibrosis-prone diseases such as Crohn's disease and IgG4-related disease, CAZyLingua uncovered CAZymes associated with disease and healthy states. In each of these metagenomic catalogs, CAZyLingua discovered new annotations that were previously overlooked by traditional sequence homology tools. Overall, the deep learning model CAZyLingua can be applied in combination with existing tools to unravel intricate CAZyme evolutionary profiles and patterns, contributing to a more comprehensive understanding of microbial metabolic dynamics.
Collapse
Affiliation(s)
- Kumar Thurimella
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK
- School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Ahmed M. T. Mohamed
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Daniel B. Graham
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Róisín M. Owens
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK
| | - Sabina Leanti La Rosa
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway
| | - Damian R. Plichta
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Sergio Bacallado
- Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK
| | - Ramnik J. Xavier
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
28
|
Tang ST, Song XW, Chen J, Shen J, Ma B, Rosen BP, Zhang J, Zhao FJ. Widespread Distribution of the arsO Gene Confers Bacterial Resistance to Environmental Antimony. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:14579-14588. [PMID: 37737118 PMCID: PMC10699511 DOI: 10.1021/acs.est.3c03458] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/23/2023]
Abstract
Microbial oxidation of environmental antimonite (Sb(III)) to antimonate (Sb(V)) is an antimony (Sb) detoxification mechanism. Ensifer adhaerens ST2, a bacterial isolate from a Sb-contaminated paddy soil, oxidizes Sb(III) to Sb(V) under oxic conditions by an unknown mechanism. Genomic analysis of ST2 reveals a gene of unknown function in an arsenic resistance (ars) operon that we term arsO. The transcription level of arsO was significantly upregulated by the addition of Sb(III). ArsO is predicted to be a flavoprotein monooxygenase but shows low sequence similarity to other flavoprotein monooxygenases. Expression of arsO in the arsenic-hypersensitive Escherichia coli strain AW3110Δars conferred increased resistance to Sb(III) but not arsenite (As(III)) or methylarsenite (MAs(III)). Purified ArsO catalyzes Sb(III) oxidation to Sb(V) with NADPH or NADH as the electron donor but does not oxidize As(III) or MAs(III). The purified enzyme contains flavin adenine dinucleotide (FAD) at a ratio of 0.62 mol of FAD/mol protein, and enzymatic activity was increased by addition of FAD. Bioinformatic analyses show that arsO genes are widely distributed in metagenomes from different environments and are particularly abundant in environments affected by human activities. This study demonstrates that ArsO is an environmental Sb(III) oxidase that plays a significant role in the detoxification of Sb(III).
Collapse
Affiliation(s)
- Shi-Tong Tang
- Jiangsu Key Laboratory for Organic Waste Utilization, Jiangsu Collaborative Innovation Center for Solid Organic Waste Resource Utilization, College of Resources and Environmental Sciences, Nanjing Agricultural University, Nanjing 210095, China
| | - Xin-Wei Song
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou 310000, China
- Hangzhou Innovation Center, Zhejiang University, Hangzhou 311200, China
| | - Jian Chen
- Institute of Environmental Remediation and Human Health, College of Ecology and Environment, Southwest Forestry University, Kunming 650224, China
| | - Jie Shen
- Jiangsu Key Laboratory for Organic Waste Utilization, Jiangsu Collaborative Innovation Center for Solid Organic Waste Resource Utilization, College of Resources and Environmental Sciences, Nanjing Agricultural University, Nanjing 210095, China
| | - Bin Ma
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou 310000, China
- Hangzhou Innovation Center, Zhejiang University, Hangzhou 311200, China
| | - Barry P Rosen
- Department of Cellular Biology and Pharmacology, Herbert Wertheim College of Medicine, Florida International University, Miami, Florida 33199, United States
| | - Jun Zhang
- Jiangsu Key Laboratory for Organic Waste Utilization, Jiangsu Collaborative Innovation Center for Solid Organic Waste Resource Utilization, College of Resources and Environmental Sciences, Nanjing Agricultural University, Nanjing 210095, China
| | - Fang-Jie Zhao
- Jiangsu Key Laboratory for Organic Waste Utilization, Jiangsu Collaborative Innovation Center for Solid Organic Waste Resource Utilization, College of Resources and Environmental Sciences, Nanjing Agricultural University, Nanjing 210095, China
| |
Collapse
|