1
|
Jamshidi MB, Hoang DT, Nguyen DN, Niyato D, Warkiani ME. Revolutionizing biological digital twins: Integrating internet of bio-nano things, convolutional neural networks, and federated learning. Comput Biol Med 2025; 189:109970. [PMID: 40101583 DOI: 10.1016/j.compbiomed.2025.109970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 02/28/2025] [Accepted: 03/01/2025] [Indexed: 03/20/2025]
Abstract
Digital twins (DTs) are advancing biotechnology by providing digital models for drug discovery, digital health applications, and biological assets, including microorganisms. However, the hypothesis posits that implementing micro- and nanoscale DTs, especially for biological entities like bacteria, presents substantial challenges. These challenges stem from the complexities of data extraction, transmission, and computation, along with the necessity for a specialized Internet of Things (IoT) infrastructure. To address these challenges, this article proposes a novel framework that leverages bio-network technologies, including the Internet of Bio-Nano Things (IoBNT), and decentralized deep learning algorithms such as federated learning (FL) and convolutional neural networks (CNN). The methodology involves using CNNs for robust pattern recognition and FL to reduce bandwidth consumption while enhancing security. IoBNT devices are utilized for precise microscopic data acquisition and transmission, which ensures minimal error rates. The results demonstrate a multi-class classification accuracy of 98.7% across 33 bacteria categories, achieving over 99% bandwidth savings. Additionally, IoBNT integration reduces biological data transfer errors by up to 98%, even under worst-case conditions. This framework is further supported by an adaptable, user-friendly dashboard, expanding its applicability across pharmaceutical and biotechnology industries.
Collapse
Affiliation(s)
- Mohammad Behdad Jamshidi
- School of Electrical and Data Engineering, University of Technology Sydney, 15 Broadway, Sydney, 2007, NSW, Australia.
| | - Dinh Thai Hoang
- School of Electrical and Data Engineering, University of Technology Sydney, 15 Broadway, Sydney, 2007, NSW, Australia
| | - Diep N Nguyen
- School of Electrical and Data Engineering, University of Technology Sydney, 15 Broadway, Sydney, 2007, NSW, Australia
| | - Dusit Niyato
- College of Computing and Data Science, Nanyang Technological University, 50 Nanyang Ave, Block N 4, Singapore, 639798, Singapore
| | - Majid Ebrahimi Warkiani
- School of Biomedical Engineering, University of Technology Sydney, 15 Broadway, Sydney, 2007, NSW, Australia
| |
Collapse
|
2
|
Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic language models: opportunities and challenges. Trends Genet 2025; 41:286-302. [PMID: 39753409 DOI: 10.1016/j.tig.2024.11.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 11/21/2024] [Accepted: 11/21/2024] [Indexed: 04/10/2025]
Abstract
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of natural language processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic language models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley, CA, USA
| | - Carlos Albors
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Jianan Canal Li
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, CA, USA; Department of Statistics, University of California, Berkeley, CA, USA; Center for Computational Biology, University of California, Berkeley, CA, USA.
| |
Collapse
|
3
|
Duan C, Zang Z, Xu Y, He H, Li S, Liu Z, Lei Z, Zheng JS, Li SZ. FGeneBERT: function-driven pre-trained gene language model for metagenomics. Brief Bioinform 2025; 26:bbaf149. [PMID: 40211978 PMCID: PMC11986344 DOI: 10.1093/bib/bbaf149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2024] [Revised: 02/22/2025] [Accepted: 03/14/2025] [Indexed: 04/14/2025] Open
Abstract
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the one-to-many and many-to-one relationships inherent in metagenomic data. To overcome these challenges, we introduce FGeneBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGeneBERT incorporates masked gene modeling to enhance the understanding of inter-gene contextual relationships and triplet enhanced metagenomic contrastive learning to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGeneBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1 to 213 k input sequences. Case studies of ATP synthase and gene operons highlight FGeneBERT's capability for functional recognition and its biological relevance in metagenomic research.
Collapse
Affiliation(s)
- Chenrui Duan
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Zelin Zang
- Centre for Artificial Intelligence and Robotics (CAIR), HKISI-CAS Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong 310000, China
| | - Yongjie Xu
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Hang He
- School of Medicine and School of Life Sciences, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Siyuan Li
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Zihan Liu
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Zhen Lei
- Centre for Artificial Intelligence and Robotics (CAIR), HKISI-CAS Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong 310000, China
- State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China
| | - Ju-Sheng Zheng
- School of Medicine and School of Life Sciences, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Stan Z Li
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| |
Collapse
|
4
|
Prabakaran R, Bromberg Y. Functional profiling of the sequence stockpile: a protein pair-based assessment of in silico prediction tools. Bioinformatics 2025; 41:btaf035. [PMID: 39854283 PMCID: PMC11821270 DOI: 10.1093/bioinformatics/btaf035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 11/04/2024] [Accepted: 01/22/2025] [Indexed: 01/26/2025] Open
Abstract
MOTIVATION In silico functional annotation of proteins is crucial to narrowing the sequencing-accelerated gap in our understanding of protein activities. Numerous function annotation methods exist, and their ranks have been growing, particularly so with the recent deep learning-based developments. However, it is unclear if these tools are truly predictive. As we are not aware of any methods that can identify new terms in functional ontologies, we ask if they can, at least, identify molecular functions of proteins that are non-homologous to or far-removed from known protein families. RESULTS Here, we explore the potential and limitations of the existing methods in predicting the molecular functions of thousands of such proteins. Lacking the "ground truth" functional annotations, we transformed the assessment of function prediction into evaluation of functional similarity of protein pairs that likely share function but are unlike any of the currently functionally annotated sequences. Notably, our approach transcends the limitations of functional annotation vocabularies, providing a means to assess different-ontology annotation methods. We find that most existing methods are limited to identifying functional similarity of homologous sequences and fail to predict the function of proteins lacking reference. Curiously, despite their seemingly unlimited by-homology scope, deep learning methods also have trouble capturing the functional signal encoded in protein sequence. We believe that our work will inspire the development of a new generation of methods that push boundaries and promote exploration and discovery in the molecular function domain. AVAILABILITY AND IMPLEMENTATION The data underlying this article are available at https://doi.org/10.6084/m9.figshare.c.6737127.v3. The code used to compute siblings is available openly at https://bitbucket.org/bromberglab/siblings-detector/.
Collapse
Affiliation(s)
- R Prabakaran
- Department of Biology, Emory University, Atlanta, GA 30322, United States
- Department of Computer Science, Emory University, Atlanta, GA 30322, United States
| | - Yana Bromberg
- Department of Biology, Emory University, Atlanta, GA 30322, United States
- Department of Computer Science, Emory University, Atlanta, GA 30322, United States
| |
Collapse
|
5
|
Mi K, Xu R, Liu X. RFW captures species-level metagenomic functions by integrating genome annotation information. CELL REPORTS METHODS 2024; 4:100932. [PMID: 39662474 PMCID: PMC11704624 DOI: 10.1016/j.crmeth.2024.100932] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2024] [Revised: 09/01/2024] [Accepted: 11/14/2024] [Indexed: 12/13/2024]
Abstract
Functional profiling of whole-metagenome shotgun sequencing (WMS) enables our understanding of microbe-host interactions. We demonstrate microbial functional information loss by current annotation methods at both the taxon and community levels, particularly at lower read depths. To address information loss, we develop a framework, RFW (reference-based functional profile inference on WMS), that utilizes information from genome functional annotations and taxonomic profiles to infer microbial function abundances from WMS. Furthermore, we provide an algorithm for absolute abundance change quantification between groups as part of the RFW framework. By applying RFW to several datasets related to autism spectrum disorder and colorectal cancer, we show that RFW augments downstream analyses, such as differential microbial function identification and association analysis between microbial function and host phenotype. RFW is open source and freely available at https://github.com/Xingyinliu-Lab/RFW.
Collapse
Affiliation(s)
- Kai Mi
- Department of Pathogen Biology-Microbiology Division, State Key Laboratory of Reproductive Medicine and Offspring Health, Key Laboratory of Pathogen of Jiangsu Province, Center of Global Health, Nanjing Medical University, Nanjing 211166, China
| | - Rui Xu
- Department of Pathogen Biology-Microbiology Division, State Key Laboratory of Reproductive Medicine and Offspring Health, Key Laboratory of Pathogen of Jiangsu Province, Center of Global Health, Nanjing Medical University, Nanjing 211166, China
| | - Xingyin Liu
- Department of Pathogen Biology-Microbiology Division, State Key Laboratory of Reproductive Medicine and Offspring Health, Key Laboratory of Pathogen of Jiangsu Province, Center of Global Health, Nanjing Medical University, Nanjing 211166, China; The Second Affiliated Hospital of Nanjing Medical University, Nanjing 211166, China.
| |
Collapse
|
6
|
Dou Z, He J, Han C, Wu X, Wan L, Yang J, Zheng Y, Gong B, Wang L. qProtein: Exploring Physical Features of Protein Thermostability Based on Structural Proteomics. J Chem Inf Model 2024; 64:7885-7894. [PMID: 39375829 DOI: 10.1021/acs.jcim.4c01303] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/09/2024]
Abstract
Thermostability, which is essential for the functional performance of enzymes, is largely determined by intramolecular physical interactions. Although many tools have been developed, existing computational methods have struggled to find the universal principles of protein thermostability. Recent advancements in structural proteomics have been driven by the introduction of deep neural networks such as AlphaFold2 and ESMFold. These innovations have enabled the characterization of protein structures with unprecedented speed and accuracy. Here, we introduce qProtein, a Python-implemented workflow designed for the quantitative analysis of physical interactions on the scale of structural proteomics. This platform accepts protein sequences as input and produces four structural features, including hydrophobic clusters, hydrogen bonds, electrostatic interactions, and disulfide bonds. To demonstrate the use of qProtein, we investigate the structural features related to protein thermostability in six glycoside hydrolase (GH) families, comprising a total of 3,811 protein structures. Our results indicate that in five enzyme families (GH11, GH12, GH5_2, GH10, and GH48), the thermophilic enzymes have a larger average area of hydrophobic clusters compared to the nonthermophilic enzymes within each family. Furthermore, our analysis of the local-structure regions reveals that the hydrophobic clusters are predominantly distributed in the distal regions of the GH11 enzymes. In addition, the average hydrophobic cluster area of the thermophilic enzymes is significantly higher than that of the nonthermophilic enzymes in the distal regions of the GH11 enzymes. Therefore, qProtein is a well-suited platform for analyzing the structural features of thermal stability at the level of structural proteomics. We provide the source code for qProtein at https://github.com/bj600800/qProtein, and the web server is available at http://qProtein.sdu.edu.cn:8888.
Collapse
Affiliation(s)
- Zhixin Dou
- State Key Laboratory of Microbial Technology, Shandong University, No. 72 Binhai Road, Qingdao 266237, P.R. China
| | - Jiaxin He
- School of Computer Science and Technology, Shandong University, No. 72 Binhai Road, Qingdao 266237, P.R. China
| | - Chao Han
- Shandong Key Laboratory of Agricultural Microbiology, Shandong Agricultural University, Tai'an 271018, China
| | - Xiuyun Wu
- State Key Laboratory of Microbial Technology, Shandong University, No. 72 Binhai Road, Qingdao 266237, P.R. China
| | - Lin Wan
- School of Software, Shandong University, Shunhua Road, Jinan 250101, P.R. China
| | - Jian Yang
- School of Computer Science and Technology, Shandong University, No. 72 Binhai Road, Qingdao 266237, P.R. China
| | - Yanwei Zheng
- School of Computer Science and Technology, Shandong University, No. 72 Binhai Road, Qingdao 266237, P.R. China
| | - Bin Gong
- School of Software, Shandong University, Shunhua Road, Jinan 250101, P.R. China
| | - Lushan Wang
- State Key Laboratory of Microbial Technology, Shandong University, No. 72 Binhai Road, Qingdao 266237, P.R. China
| |
Collapse
|
7
|
Bobbo T, Biscarini F, Yaddehige SK, Alberghini L, Rigoni D, Bianchi N, Taccioli C. Machine learning classification of archaea and bacteria identifies novel predictive genomic features. BMC Genomics 2024; 25:955. [PMID: 39402493 PMCID: PMC11472548 DOI: 10.1186/s12864-024-10832-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 09/24/2024] [Indexed: 10/19/2024] Open
Abstract
BACKGROUND Archaea and Bacteria are distinct domains of life that are adapted to a variety of ecological niches. Several genome-based methods have been developed for their accurate classification, yet many aspects of the specific genomic features that determine these differences are not fully understood. In this study, we used publicly available whole-genome sequences from bacteria ( N = 2546 ) and archaea ( N = 109 ). From these, a set of genomic features (nucleotide frequencies and proportions, coding sequences (CDS), non-coding, ribosomal and transfer RNA genes (ncRNA, rRNA, tRNA), Chargaff's, topological entropy and Shannon's entropy scores) was extracted and used as input data to develop machine learning models for the classification of archaea and bacteria. RESULTS The classification accuracy ranged from 0.993 (Random Forest) to 0.998 (Neural Networks). Over the four models, only 11 examples were misclassified, especially those belonging to the minority class (Archaea). From variable importance, tRNA topological and Shannon's entropy, nucleotide frequencies in tRNA, rRNA and ncRNA, CDS, tRNA and rRNA Chargaff's scores have emerged as the top discriminating factors. In particular, tRNA entropy (both topological and Shannon's) was the most important genomic feature for classification, pointing at the complex interactions between the genetic code, tRNAs and the translational machinery. CONCLUSIONS tRNA, rRNA and ncRNA genes emerged as the key genomic elements that underpin the classification of archaea and bacteria. In particular, higher nucleotide diversity was found in tRNA from bacteria compared to archaea. The analysis of the few classification errors reflects the complex phylogenetic relationships between bacteria, archaea and eukaryotes.
Collapse
Affiliation(s)
- Tania Bobbo
- Institute for Biomedical Technologies, National Research Council (CNR), Via Fratelli Cervi 93, Segrate (MI), 20054, Italy
| | - Filippo Biscarini
- Institute of Agricultural Biology and Biotechnology, National Research Council (CNR), Via Edoardo Bassini 15, Milano, 20133, Italy.
| | - Sachithra K Yaddehige
- Department of Animal Medicine, Health and Production, University of Padova, Viale dell'Universitá 16, Legnaro, 35020, Italy
| | - Leonardo Alberghini
- Department of Animal Medicine, Health and Production, University of Padova, Viale dell'Universitá 16, Legnaro, 35020, Italy
| | - Davide Rigoni
- Department of Pharmaceutical and Pharmacological Sciences, University of Padova, Via Francesco Marzolo 5, Padova, 35131, Italy
| | - Nicoletta Bianchi
- Department of Translational Medicine, University of Ferrara, Via Luigi Borsari 46, Ferrara, 44121, Italy.
| | - Cristian Taccioli
- Department of Animal Medicine, Health and Production, University of Padova, Viale dell'Universitá 16, Legnaro, 35020, Italy.
| |
Collapse
|
8
|
Karavaeva V, Sousa FL. Navigating the archaeal frontier: insights and projections from bioinformatic pipelines. Front Microbiol 2024; 15:1433224. [PMID: 39380680 PMCID: PMC11459464 DOI: 10.3389/fmicb.2024.1433224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Accepted: 08/28/2024] [Indexed: 10/10/2024] Open
Abstract
Archaea continues to be one of the least investigated domains of life, and in recent years, the advent of metagenomics has led to the discovery of many new lineages at the phylum level. For the majority, only automatic genomic annotations can provide information regarding their metabolic potential and role in the environment. Here, genomic data from 2,978 archaeal genomes was used to perform automatic annotations using bioinformatics tools, alongside synteny analysis. These automatic classifications were done to assess how good these different tools perform in relation to archaeal data. Our study revealed that even with lowered cutoffs, several functional models do not capture the recently discovered archaeal diversity. Moreover, our investigation revealed that a significant portion of archaeal genomes, approximately 42%, remain uncharacterized. In comparison, within 3,235 bacterial genomes, a diverse range of unclassified proteins is obtained, with well-studied organisms like Escherichia coli having a substantially lower proportion of uncharacterized regions, ranging from <5 to 25%, and less studied lineages being comparable to archaea with the range of 35-40% of unclassified regions. Leveraging this analysis, we were able to identify metabolic protein markers, thereby providing insights into the metabolism of the archaea in our dataset. Our findings underscore a substantial gap between automatic classification tools and the comprehensive mapping of archaeal metabolism. Despite advances in computational approaches, a significant portion of archaeal genomes remains unexplored, highlighting the need for extensive experimental validation in this domain, as well as more refined annotation methods. This study contributes to a better understanding of archaeal metabolism and underscores the importance of further research in elucidating the functional potential of archaeal genomes.
Collapse
Affiliation(s)
- Val Karavaeva
- Genome Evolution and Ecology Group, Department of Functional and Evolutionary Ecology, University of Vienna, Vienna, Austria
- Vienna Doctoral School of Ecology and Evolution, University of Vienna, Vienna, Austria
| | - Filipa L. Sousa
- Genome Evolution and Ecology Group, Department of Functional and Evolutionary Ecology, University of Vienna, Vienna, Austria
| |
Collapse
|
9
|
Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic Language Models: Opportunities and Challenges. ARXIV 2024:arXiv:2407.11435v2. [PMID: 39070037 PMCID: PMC11275703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Computer Science Division, University of California, Berkeley
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley
| | - Carlos Albors
- Computer Science Division, University of California, Berkeley
| | - Jianan Canal Li
- Computer Science Division, University of California, Berkeley
| | - Yun S. Song
- Computer Science Division, University of California, Berkeley
- Department of Statistics, University of California, Berkeley
- Center for Computational Biology, University of California, Berkeley
| |
Collapse
|
10
|
Fu Y, Yu S, Li J, Lao Z, Yang X, Lin Z. DeepMineLys: Deep mining of phage lysins from human microbiome. Cell Rep 2024; 43:114583. [PMID: 39110597 DOI: 10.1016/j.celrep.2024.114583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 06/21/2024] [Accepted: 07/19/2024] [Indexed: 09/01/2024] Open
Abstract
Vast shotgun metagenomics data remain an underutilized resource for novel enzymes. Artificial intelligence (AI) has increasingly been applied to protein mining, but its conventional performance evaluation is interpolative in nature, and these trained models often struggle to extrapolate effectively when challenged with unknown data. In this study, we present a framework (DeepMineLys [deep mining of phage lysins from human microbiome]) based on the convolutional neural network (CNN) to identify phage lysins from three human microbiome datasets. When validated with an independent dataset, our method achieved an F1-score of 84.00%, surpassing existing methods by 20.84%. We expressed 16 lysin candidates from the top 100 sequences in E. coli, confirming 11 as active. The best one displayed an activity 6.2-fold that of lysozyme derived from hen egg white, establishing it as the most potent lysin from the human microbiome. Our study also underscores several important issues when applying AI to biology questions. This framework should be applicable for mining other proteins.
Collapse
Affiliation(s)
- Yiran Fu
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Shuting Yu
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Jianfeng Li
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Zisha Lao
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Xiaofeng Yang
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.
| | - Zhanglin Lin
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.
| |
Collapse
|
11
|
Mendoza-Revilla J, Trop E, Gonzalez L, Roller M, Dalla-Torre H, de Almeida BP, Richard G, Caton J, Lopez Carranza N, Skwark M, Laterre A, Beguir K, Pierrot T, Lopez M. A foundational large language model for edible plant genomes. Commun Biol 2024; 7:835. [PMID: 38982288 PMCID: PMC11233511 DOI: 10.1038/s42003-024-06465-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 06/17/2024] [Indexed: 07/11/2024] Open
Abstract
Significant progress has been made in the field of plant genomics, as demonstrated by the increased use of high-throughput methodologies that enable the characterization of multiple genome-wide molecular phenotypes. These findings have provided valuable insights into plant traits and their underlying genetic mechanisms, particularly in model plant species. Nonetheless, effectively leveraging them to make accurate predictions represents a critical step in crop genomic improvement. We present AgroNT, a foundational large language model trained on genomes from 48 plant species with a predominant focus on crop species. We show that AgroNT can obtain state-of-the-art predictions for regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. We conduct a large-scale in silico saturation mutagenesis analysis on cassava to evaluate the regulatory impact of over 10 million mutations and provide their predicted effects as a resource for variant characterization. Finally, we propose the use of the diverse datasets compiled here as the Plants Genomic Benchmark (PGB), providing a comprehensive benchmark for deep learning-based methods in plant genomic research. The pre-trained AgroNT model is publicly available on HuggingFace at https://huggingface.co/InstaDeepAI/agro-nucleotide-transformer-1b for future research purposes.
Collapse
|
12
|
Qiu Z, Zhu Y, Zhang Q, Qiao X, Mu R, Xu Z, Yan Y, Wang F, Zhang T, Zhuang WQ, Yu K. Unravelling biosynthesis and biodegradation potentials of microbial dark matters in hypersaline lakes. ENVIRONMENTAL SCIENCE AND ECOTECHNOLOGY 2024; 20:100359. [PMID: 39221074 PMCID: PMC11361885 DOI: 10.1016/j.ese.2023.100359] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 11/26/2023] [Accepted: 11/26/2023] [Indexed: 09/04/2024]
Abstract
Biosynthesis and biodegradation of microorganisms critically underpin the development of biotechnology, new drugs and therapies, and environmental remediation. However, most uncultured microbial species along with their metabolic capacities in extreme environments, remain obscured. Here we unravel the metabolic potential of microbial dark matters (MDMs) in four deep-inland hypersaline lakes in Xinjiang, China. Utilizing metagenomic binning, we uncovered a rich diversity of 3030 metagenome-assembled genomes (MAGs) across 82 phyla, revealing a substantial portion, 2363 MAGs, as previously unclassified at the genus level. These unknown MAGs displayed unique distribution patterns across different lakes, indicating a strong correlation with varied physicochemical conditions. Our analysis revealed an extensive array of 9635 biosynthesis gene clusters (BGCs), with a remarkable 9403 being novel, suggesting untapped biotechnological potential. Notably, some MAGs from potentially new phyla exhibited a high density of these BGCs. Beyond biosynthesis, our study also identified novel biodegradation pathways, including dehalogenation, anaerobic ammonium oxidation (Anammox), and degradation of polycyclic aromatic hydrocarbons (PAHs) and plastics, in previously unknown microbial clades. These findings significantly enrich our understanding of biosynthesis and biodegradation processes and open new avenues for biotechnological innovation, emphasizing the untapped potential of microbial diversity in hypersaline environments.
Collapse
Affiliation(s)
- Zhiguang Qiu
- School of Environment and Energy, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, 518055, China
| | - Yuanyuan Zhu
- School of Environment and Energy, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
| | - Qing Zhang
- School of Environment and Energy, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
| | - Xuejiao Qiao
- School of Environment and Energy, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
| | - Rong Mu
- School of Environment and Energy, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
| | - Zheng Xu
- Southern University of Sciences and Technology Yantian Hospital, Shenzhen, 518081, China
- Institute of Biomedicine and Biotechnology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Yan Yan
- State Key Laboratory of Isotope Geochemistry, CAS Center for Excellence in Deep Earth Science, Guangzhou Institute of Geochemistry, Chinese Academy of Sciences, Guangzhou, 510640, China
| | - Fan Wang
- School of Atmospheric Sciences, Sun Yat-sen University, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai, 519082, China
| | - Tong Zhang
- Department of Civil Engineering, University of Hong Kong, 999077, Hong Kong, China
| | - Wei-Qin Zhuang
- Department of Civil and Environmental Engineering, Faculty of Engineering, University of Auckland, New Zealand
| | - Ke Yu
- School of Environment and Energy, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, 518055, China
| |
Collapse
|
13
|
Urhan A, Cosma BM, Earl AM, Manson AL, Abeel T. SAFPred: synteny-aware gene function prediction for bacteria using protein embeddings. Bioinformatics 2024; 40:btae328. [PMID: 38775729 PMCID: PMC11147799 DOI: 10.1093/bioinformatics/btae328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 04/08/2024] [Accepted: 05/21/2024] [Indexed: 06/04/2024] Open
Abstract
MOTIVATION Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models-adopted from the natural language processing field-have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes. RESULTS To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health. AVAILABILITY AND IMPLEMENTATION https://github.com/AbeelLab/safpred.
Collapse
Affiliation(s)
- Aysun Urhan
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Delft XE 2628, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
| | - Bianca-Maria Cosma
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Delft XE 2628, The Netherlands
| | - Ashlee M Earl
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
| | - Abigail L Manson
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Delft XE 2628, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
| |
Collapse
|
14
|
Quan P, Li X, Si Y, Sun L, Ding FF, Fan Y, Liu H, Wei C, Li R, Zhao X, Yang F, Yao L. Single cell analysis reveals the roles and regulatory mechanisms of type-I interferons in Parkinson's disease. Cell Commun Signal 2024; 22:212. [PMID: 38566100 PMCID: PMC10985960 DOI: 10.1186/s12964-024-01590-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 03/23/2024] [Indexed: 04/04/2024] Open
Abstract
The pathogenesis of Parkinson's disease (PD) is strongly associated with neuroinflammation, and type I interferons (IFN-I) play a crucial role in regulating immune and inflammatory responses. However, the specific features of IFN in different cell types and the underlying mechanisms of PD have yet to be fully described. In this study, we analyzed the GSE157783 dataset, which includes 39,024 single-cell RNA sequencing results for five PD patients and six healthy controls from the Gene Expression Omnibus database. After cell type annotation, we intersected differentially expressed genes in each cell subcluster with genes collected in The Interferome database to generate an IFN-I-stimulated gene set (ISGs). Based on this gene set, we used the R package AUCell to score each cell, representing the IFN-I activity. Additionally, we performed monocle trajectory analysis, and single-cell regulatory network inference and clustering (SCENIC) to uncover the underlying mechanisms. In silico gene perturbation and subsequent experiments confirm NFATc2 regulation of type I interferon response and neuroinflammation. Our analysis revealed that microglia, endothelial cells, and pericytes exhibited the highest activity of IFN-I. Furthermore, single-cell trajectory detection demonstrated that microglia in the midbrain of PD patients were in a pro-inflammatory activation state, which was validated in the 1-Methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP)-induced PD mouse model as well. We identified transcription factors NFATc2, which was significantly up-regulated and involved in the expression of ISGs and activation of microglia in PD. In the 1-Methyl-4-phenylpyridinium (MPP+)-induced BV2 cell model, the suppression of NFATc2 resulted in a reduction in IFN-β levels, impeding the phosphorylation of STAT1, and attenuating the activation of the NF-κB pathway. Furthermore, the downregulation of NFATc2 mitigated the detrimental effects on SH-SY5Y cells co-cultured in conditioned medium. Our study highlights the critical role of microglia in type I interferon responses in PD. Additionally, we identified transcription factors NFATc2 as key regulators of aberrant type I interferon responses and microglial pro-inflammatory activation in PD. These findings provide new insights into the pathogenesis of PD and may have implications for the development of novel therapeutic strategies.
Collapse
Affiliation(s)
- Pusheng Quan
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
- Department of Neurology, The Affiliated Hospital of Inner Mongolia Medical University, Hohhot, China
| | - Xueying Li
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Yao Si
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Linlin Sun
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Fei Fan Ding
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Yuwei Fan
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Han Liu
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Chengqun Wei
- Department of General Practice, Heilongjiang Provincial Hospital, Harbin, China
| | - Ruihua Li
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Xue Zhao
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Fan Yang
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China.
| | - Lifen Yao
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China.
| |
Collapse
|
15
|
Chen K, Zhou Y, Ding M, Wang Y, Ren Z, Yang Y. Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Brief Bioinform 2024; 25:bbae163. [PMID: 38605640 PMCID: PMC11009468 DOI: 10.1093/bib/bbae163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 02/22/2024] [Accepted: 03/19/2024] [Indexed: 04/13/2024] Open
Abstract
Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequence-based modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.
Collapse
Affiliation(s)
- Ken Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yue Zhou
- Peng Cheng Laboratory, Shenzhen, China
| | - Maolin Ding
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yu Wang
- Peng Cheng Laboratory, Shenzhen, China
| | | | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
- Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, China
| |
Collapse
|
16
|
Ligeti B, Szepesi-Nagy I, Bodnár B, Ligeti-Nagy N, Juhász J. ProkBERT family: genomic language models for microbiome applications. Front Microbiol 2024; 14:1331233. [PMID: 38282738 PMCID: PMC10810988 DOI: 10.3389/fmicb.2023.1331233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 12/11/2023] [Indexed: 01/30/2024] Open
Abstract
Background In the evolving landscape of microbiology and microbiome analysis, the integration of machine learning is crucial for understanding complex microbial interactions, and predicting and recognizing novel functionalities within extensive datasets. However, the effectiveness of these methods in microbiology faces challenges due to the complex and heterogeneous nature of microbial data, further complicated by low signal-to-noise ratios, context-dependency, and a significant shortage of appropriately labeled datasets. This study introduces the ProkBERT model family, a collection of large language models, designed for genomic tasks. It provides a generalizable sequence representation for nucleotide sequences, learned from unlabeled genome data. This approach helps overcome the above-mentioned limitations in the field, thereby improving our understanding of microbial ecosystems and their impact on health and disease. Methods ProkBERT models are based on transfer learning and self-supervised methodologies, enabling them to use the abundant yet complex microbial data effectively. The introduction of the novel Local Context-Aware (LCA) tokenization technique marks a significant advancement, allowing ProkBERT to overcome the contextual limitations of traditional transformer models. This methodology not only retains rich local context but also demonstrates remarkable adaptability across various bioinformatics tasks. Results In practical applications such as promoter prediction and phage identification, the ProkBERT models show superior performance. For promoter prediction tasks, the top-performing model achieved a Matthews Correlation Coefficient (MCC) of 0.74 for E. coli and 0.62 in mixed-species contexts. In phage identification, ProkBERT models consistently outperformed established tools like VirSorter2 and DeepVirFinder, achieving an MCC of 0.85. These results underscore the models' exceptional accuracy and generalizability in both supervised and unsupervised tasks. Conclusions The ProkBERT model family is a compact yet powerful tool in the field of microbiology and bioinformatics. Its capacity for rapid, accurate analyses and its adaptability across a spectrum of tasks marks a significant advancement in machine learning applications in microbiology. The models are available on GitHub (https://github.com/nbrg-ppcu/prokbert) and HuggingFace (https://huggingface.co/nerualbioinfo) providing an accessible tool for the community.
Collapse
Affiliation(s)
- Balázs Ligeti
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - István Szepesi-Nagy
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - Babett Bodnár
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - Noémi Ligeti-Nagy
- Language Technology Research Group, HUN-REN Hungarian Research Centre for Linguistics, Budapest, Hungary
| | - János Juhász
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
- Institute of Medical Microbiology, Semmelweis University, Budapest, Hungary
| |
Collapse
|
17
|
McGuinness KN, Fehon N, Feehan R, Miller M, Mutter AC, Rybak LA, Nam J, AbuSalim JE, Atkinson JT, Heidari H, Losada N, Kim JD, Koder RL, Lu Y, Silberg JJ, Slusky JSG, Falkowski PG, Nanda V. The energetics and evolution of oxidoreductases in deep time. Proteins 2024; 92:52-59. [PMID: 37596815 DOI: 10.1002/prot.26563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/06/2023] [Indexed: 08/20/2023]
Abstract
The core metabolic reactions of life drive electrons through a class of redox protein enzymes, the oxidoreductases. The energetics of electron flow is determined by the redox potentials of organic and inorganic cofactors as tuned by the protein environment. Understanding how protein structure affects oxidation-reduction energetics is crucial for studying metabolism, creating bioelectronic systems, and tracing the history of biological energy utilization on Earth. We constructed ProtReDox (https://protein-redox-potential.web.app), a manually curated database of experimentally determined redox potentials. With over 500 measurements, we can begin to identify how proteins modulate oxidation-reduction energetics across the tree of life. By mapping redox potentials onto networks of oxidoreductase fold evolution, we can infer the evolution of electron transfer energetics over deep time. ProtReDox is designed to include user-contributed submissions with the intention of making it a valuable resource for researchers in this field.
Collapse
Affiliation(s)
- Kenneth N McGuinness
- Department of Natural Sciences, Caldwell University, Caldwell, New Jersey, USA
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - Nolan Fehon
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Ryan Feehan
- Computational Biology Program, The University of Kansas, Lawrence, Kansas, USA
| | - Michelle Miller
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Andrew C Mutter
- Department of Physics, The City College of New York, New York, New York, USA
| | - Laryssa A Rybak
- Department of Physics, The City College of New York, New York, New York, USA
| | - Justin Nam
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - Jenna E AbuSalim
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - Joshua T Atkinson
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, USA
| | - Hirbod Heidari
- Department of Chemistry, University of Texas at Austin, Austin, Texas, USA
| | - Natalie Losada
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - J Dongun Kim
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Ronald L Koder
- Department of Physics, The City College of New York, New York, New York, USA
| | - Yi Lu
- Department of Chemistry, University of Texas at Austin, Austin, Texas, USA
| | - Jonathan J Silberg
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, USA
| | - Joanna S G Slusky
- Computational Biology Program, The University of Kansas, Lawrence, Kansas, USA
- Department of Molecular Biosciences, The University of Kansas, Lawrence, Kansas, USA
| | - Paul G Falkowski
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
- Department of Earth and Planetary Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Vikas Nanda
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
- Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers University, Piscataway, New Jersey, USA
| |
Collapse
|
18
|
Aliperti L, Aptekmann AA, Farfañuk G, Couso LL, Soler-Bistué A, Sánchez IE. r/K selection of GC content in prokaryotes. Environ Microbiol 2023; 25:3255-3268. [PMID: 37813828 DOI: 10.1111/1462-2920.16511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 09/16/2023] [Indexed: 10/11/2023]
Abstract
The guanine/cytosine (GC) content of prokaryotic genomes is species-specific, taking values from 16% to 77%. This diversity of selection for GC content remains contentious. We analyse the correlations between GC content and a range of phenotypic and genotypic data in thousands of prokaryotes. GC content integrates well with these traits into r/K selection theory when phenotypic plasticity is considered. High GC-content prokaryotes are r-strategists with cheaper descendants thanks to a lower average amino acid metabolic cost, colonize unstable environments thanks to flagella and a bacillus form and are generalists in terms of resource opportunism and their defence mechanisms. Low GC content prokaryotes are K-strategists specialized for stable environments that maintain homeostasis via a high-cost outer cell membrane and endospore formation as a response to nutrient deprivation, and attain a higher nutrient-to-biomass yield. The lower proteome cost of high GC content prokaryotes is driven by the association between GC-rich codons and cheaper amino acids in the genetic code, while the correlation between GC content and genome size may be partly due to functional diversity driven by r/K selection. In all, molecular diversity in the GC content of prokaryotes may be a consequence of ecological r/K selection.
Collapse
Affiliation(s)
- Lucio Aliperti
- Facultad de Ciencias Exactas y Naturales. Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Ariel A Aptekmann
- Marine and Coastal Sciences Department, Rutgers University, New Brunswick, New Jersey, USA
| | - Gonzalo Farfañuk
- Facultad de Ciencias Exactas y Naturales. Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Luciana L Couso
- Facultad de Agronomía, Cátedra de Genética, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Alfonso Soler-Bistué
- Instituto de Investigaciones Biotecnológicas Dr. Rodolfo A. Ugalde, CONICET, Universidad Nacional de San Martín, San Martin, Argentina
| | - Ignacio E Sánchez
- Facultad de Ciencias Exactas y Naturales. Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos Aires, Argentina
| |
Collapse
|
19
|
Marcos-Zambrano LJ, López-Molina VM, Bakir-Gungor B, Frohme M, Karaduzovic-Hadziabdic K, Klammsteiner T, Ibrahimi E, Lahti L, Loncar-Turukalo T, Dhamo X, Simeon A, Nechyporenko A, Pio G, Przymus P, Sampri A, Trajkovik V, Lacruz-Pleguezuelos B, Aasmets O, Araujo R, Anagnostopoulos I, Aydemir Ö, Berland M, Calle ML, Ceci M, Duman H, Gündoğdu A, Havulinna AS, Kaka Bra KHN, Kalluci E, Karav S, Lode D, Lopes MB, May P, Nap B, Nedyalkova M, Paciência I, Pasic L, Pujolassos M, Shigdel R, Susín A, Thiele I, Truică CO, Wilmes P, Yilmaz E, Yousef M, Claesson MJ, Truu J, Carrillo de Santa Pau E. A toolbox of machine learning software to support microbiome analysis. Front Microbiol 2023; 14:1250806. [PMID: 38075858 PMCID: PMC10704913 DOI: 10.3389/fmicb.2023.1250806] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/11/2023] [Indexed: 05/14/2025] Open
Abstract
The human microbiome has become an area of intense research due to its potential impact on human health. However, the analysis and interpretation of this data have proven to be challenging due to its complexity and high dimensionality. Machine learning (ML) algorithms can process vast amounts of data to uncover informative patterns and relationships within the data, even with limited prior knowledge. Therefore, there has been a rapid growth in the development of software specifically designed for the analysis and interpretation of microbiome data using ML techniques. These software incorporate a wide range of ML algorithms for clustering, classification, regression, or feature selection, to identify microbial patterns and relationships within the data and generate predictive models. This rapid development with a constant need for new developments and integration of new features require efforts into compile, catalog and classify these tools to create infrastructures and services with easy, transparent, and trustable standards. Here we review the state-of-the-art for ML tools applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on ML based software and framework resources currently available for the analysis of microbiome data in humans. The aim is to support microbiologists and biomedical scientists to go deeper into specialized resources that integrate ML techniques and facilitate future benchmarking to create standards for the analysis of microbiome data. The software resources are organized based on the type of analysis they were developed for and the ML techniques they implement. A description of each software with examples of usage is provided including comments about pitfalls and lacks in the usage of software based on ML methods in relation to microbiome data that need to be considered by developers and users. This review represents an extensive compilation to date, offering valuable insights and guidance for researchers interested in leveraging ML approaches for microbiome analysis.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Víctor Manuel López-Molina
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gül University, Kayseri, Türkiye
| | - Marcus Frohme
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | | | - Thomas Klammsteiner
- Department of Microbiology and Department of Ecology, University of Innsbruck, Innsbruck, Austria
| | - Eliana Ibrahimi
- Department of Biology, University of Tirana, Tirana, Albania
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | | | - Xhilda Dhamo
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Alina Nechyporenko
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
- Department of Systems Engineering, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine
| | - Gianvito Pio
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
- Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
| | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Alexia Sampri
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, United Kingdom
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Blanca Lacruz-Pleguezuelos
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Ricardo Araujo
- Nephrology and Infectious Diseases R & D Group, i3S—Instituto de Investigação e Inovação em Saúde; INEB—Instituto de Engenharia Biomédica, Universidade do Porto, Porto, Portugal
| | - Ioannis Anagnostopoulos
- Department of Informatics, University of Piraeus, Piraeus, Greece
- Computer Science and Biomedical Informatics Department, University of Thessaly, Lamia, Greece
| | - Önder Aydemir
- Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Türkiye
| | - Magali Berland
- INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
| | - M. Luz Calle
- Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
- IRIS-CC, Fundació Institut de Recerca i Innovació en Ciències de la Vida i la Salut a la Catalunya Central, Vic, Barcelona, Spain
| | - Michelangelo Ceci
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
- Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
| | - Hatice Duman
- Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
| | - Aycan Gündoğdu
- Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Türkiye
- Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Türkiye
| | - Aki S. Havulinna
- Finnish Institute for Health and Welfare - THL, Helsinki, Finland
- Institute for Molecular Medicine Finland, FIMM-HiLIFE, Helsinki, Finland
| | | | - Eglantina Kalluci
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Sercan Karav
- Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
| | - Daniel Lode
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | - Marta B. Lopes
- Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Bram Nap
- School of Medicine, University of Galway, Galway, Ireland
| | - Miroslava Nedyalkova
- Department of Inorganic Chemistry, Faculty of Chemistry and Pharmacy, University of Sofia, Sofia, Bulgaria
| | - Inês Paciência
- Center for Environmental and Respiratory Health Research (CERH), Research Unit of Population Health, University of Oulu, Oulu, Finland
- Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Lejla Pasic
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Meritxell Pujolassos
- Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Antonio Susín
- Mathematical Department, UPC-Barcelona Tech, Barcelona, Spain
| | - Ines Thiele
- School of Medicine, University of Galway, Galway, Ireland
- APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Ciprian-Octavian Truică
- Computer Science and Engineering Department, Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica, Bucharest, Romania
| | - Paul Wilmes
- Systems Ecology Group, Luxembourg Centre for Systems Biomedicine, Esch-sur-Alzette, Luxembourg
- Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, Belvaux, Luxembourg
| | - Ercument Yilmaz
- Department of Computer Technologies, Karadeniz Technical University, Trabzon, Türkiye
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Marcus Joakim Claesson
- APC Microbiome Ireland, University College Cork, Cork, Ireland
- School of Microbiology, University College Cork, Cork, Ireland
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | | |
Collapse
|
20
|
Ma B, Lu C, Wang Y, Yu J, Zhao K, Xue R, Ren H, Lv X, Pan R, Zhang J, Zhu Y, Xu J. A genomic catalogue of soil microbiomes boosts mining of biodiversity and genetic resources. Nat Commun 2023; 14:7318. [PMID: 37951952 PMCID: PMC10640626 DOI: 10.1038/s41467-023-43000-z] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 10/27/2023] [Indexed: 11/14/2023] Open
Abstract
Soil harbors a vast expanse of unidentified microbes, termed as microbial dark matter, presenting an untapped reservo)ir of microbial biodiversity and genetic resources, but has yet to be fully explored. In this study, we conduct a large-scale excavation of soil microbial dark matter by reconstructing 40,039 metagenome-assembled genome bins (the SMAG catalogue) from 3304 soil metagenomes. We identify 16,530 of 21,077 species-level genome bins (SGBs) as unknown SGBs (uSGBs), which expand archaeal and bacterial diversity across the tree of life. We also illustrate the pivotal role of uSGBs in augmenting soil microbiome's functional landscape and intra-species genome diversity, providing large proportions of the 43,169 biosynthetic gene clusters and 8545 CRISPR-Cas genes. Additionally, we determine that uSGBs contributed 84.6% of previously unexplored viral-host associations from the SMAG catalogue. The SMAG catalogue provides an useful genomic resource for further studies investigating soil microbial biodiversity and genetic resources.
Collapse
Affiliation(s)
- Bin Ma
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
- Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Caiyu Lu
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
- Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Yiling Wang
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
- Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Jingwen Yu
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Kankan Zhao
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
- Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China
| | - Ran Xue
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Hao Ren
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Xiaofei Lv
- Department of Environmental Engineering, China Jiliang University, Hangzhou, 310018, China
| | - Ronghui Pan
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Jiabao Zhang
- State Key Laboratory of Soil and Sustainable Agriculture, Institute of Soil Science, Chinese Academy of Sciences, Nanjing, 210008, China
| | - Yongguan Zhu
- Research Center for Eco-environmental Sciences, Chinese Academy of Sciences, Beijing, 100085, China
| | - Jianming Xu
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China.
- Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China.
| |
Collapse
|
21
|
Benegas G, Batra SS, Song YS. DNA language models are powerful predictors of genome-wide variant effects. Proc Natl Acad Sci U S A 2023; 120:e2311219120. [PMID: 37883436 PMCID: PMC10622914 DOI: 10.1073/pnas.2311219120] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 09/08/2023] [Indexed: 10/28/2023] Open
Abstract
The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Graduate Group in Computational Biology, University of California, Berkeley, CA94720
| | | | - Yun S. Song
- Computer Science Division, University of California, Berkeley, CA94720
- Department of Statistics, University of California, Berkeley, CA94720
- Center for Computational Biology, University of California, Berkeley, CA94720
| |
Collapse
|
22
|
Mahlich Y, Zhu C, Chung H, Velaga PK, De Paolis Kaluza M, Radivojac P, Friedberg I, Bromberg Y. Learning from the unknown: exploring the range of bacterial functionality. Nucleic Acids Res 2023; 51:10162-10175. [PMID: 37739408 PMCID: PMC10602916 DOI: 10.1093/nar/gkad757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 09/11/2023] [Indexed: 09/24/2023] Open
Abstract
Determining the repertoire of a microbe's molecular functions is a central question in microbial biology. Modern techniques achieve this goal by comparing microbial genetic material against reference databases of functionally annotated genes/proteins or known taxonomic markers such as 16S rRNA. Here, we describe a novel approach to exploring bacterial functional repertoires without reference databases. Our Fusion scheme establishes functional relationships between bacteria and assigns organisms to Fusion-taxa that differ from otherwise defined taxonomic clades. Three key findings of our work stand out. First, bacterial functional comparisons outperform marker genes in assigning taxonomic clades. Fusion profiles are also better for this task than other functional annotation schemes. Second, Fusion-taxa are robust to addition of novel organisms and are, arguably, able to capture the environment-driven bacterial diversity. Finally, our alignment-free nucleic acid-based Siamese Neural Network model, created using Fusion functions, enables finding shared functionality of very distant, possibly structurally different, microbial homologs. Our work can thus help annotate functional repertoires of bacterial organisms and further guide our understanding of microbial communities.
Collapse
Affiliation(s)
- Yannick Mahlich
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - Chengsheng Zhu
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
- Xbiome Inc., 1 Broadway, 14th fl, Cambridge, MA 02142, USA
| | - Henri Chung
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
- Interdepartmental program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA 50011, USA
| | - Pavan K Velaga
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - M Clara De Paolis Kaluza
- Khoury College of Computer Sciences, Northeastern University, 177 Huntington Avenue, Boston, MA 02115, USA
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, 177 Huntington Avenue, Boston, MA 02115, USA
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
- Interdepartmental program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA 50011, USA
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
- Department of Biology, Emory University, 1510 Clifton Road NE, Atlanta, GA 30322, USA
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA 30322, USA
| |
Collapse
|
23
|
Medina-Chávez NO, Viladomat-Jasso M, Zarza E, Islas-Robles A, Valdivia-Anistro J, Thalasso-Siret F, Eguiarte LE, Olmedo-Álvarez G, Souza V, De la Torre-Zavala S. A Transiently Hypersaline Microbial Mat Harbors a Diverse and Stable Archaeal Community in the Cuatro Cienegas Basin, Mexico. ASTROBIOLOGY 2023; 23:796-811. [PMID: 37279013 DOI: 10.1089/ast.2021.0047] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Microbial mats are biologically diverse communities that are analogs to some of the earliest ecosystems on Earth. In this study, we describe a unique transiently hypersaline microbial mat uncovered in a shallow pond within the Cuatro Cienegas Basin (CCB) in northern México. The CCB is an endemism-rich site that harbors living stromatolites that have been studied to understand the conditions of the Precambrian Earth. These microbial mats form elastic domes filled with biogenic gas, and the mats have a relatively large and stable subpopulation of archaea. For this reason, this site has been termed archaean domes (AD). The AD microbial community was analyzed by metagenomics over three seasons. The mat exhibited a highly diverse prokaryotic community dominated by bacteria. Bacterial sequences are represented in 37 phyla, mainly Proteobacteria, Firmicutes, and Actinobacteria, that together comprised >50% of the sequences from the mat. Archaea represented up to 5% of the retrieved sequences, with up to 230 different archaeal species that belong to 5 phyla (Euryarchaeota, Crenarchaeota, Thaumarchaeota, Korarchaeota, and Nanoarchaeota). The archaeal taxa showed low variation despite fluctuations in water and nutrient availability. In addition, predicted functions highlight stress responses to extreme conditions present in the AD, including salinity, pH, and water/drought fluctuation. The observed complexity of the AD mat thriving in high pH and fluctuating water and salt conditions within the CCB provides an extant model of great value for evolutionary studies, as well as a suitable analog to the early Earth and Mars.
Collapse
Affiliation(s)
- Nahui-Olin Medina-Chávez
- Ecology, Evolution and Behavior, University of Minnesota, St. Paul, Minnesota, USA
- Universidad Autónoma de Nuevo León, Facultad de Ciencias Biológicas, Instituto de Biotecnología, San Nicolás de los Garza, México
| | | | - Eugenia Zarza
- Departamento de Ciencias de la Sustentabilidad, El Colegio de la Frontera Sur, Tapachula, Mexico
- Consejo Nacional de Ciencia y Tecnología, Ciudad de México, México
| | - Africa Islas-Robles
- Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del I.P.N. Campus Irapuato, Irapuato, México
| | - Jorge Valdivia-Anistro
- Unidad Multidisciplinaria de Investigación Experimental Zaragoza, Facultad de Estudios Superiores Zaragoza, UNAM, Ciudad de México, México
| | - Frédéric Thalasso-Siret
- Departamento de Biotecnología y Bioingeniería, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional, Ciudad de México, Mexico
| | - Luis E Eguiarte
- Departamento de Ecología Evolutiva, Instituto de Ecología, UNAM, Ciudad de México, México
- Centro de Estudios del Cuaternario de Fuego-Patagonia y Antártica (CEQUA), Punta Arenas, Chile
| | - Gabriela Olmedo-Álvarez
- Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del I.P.N. Campus Irapuato, Irapuato, México
| | - Valeria Souza
- Departamento de Ecología Evolutiva, Instituto de Ecología, UNAM, Ciudad de México, México
- Centro de Estudios del Cuaternario de Fuego-Patagonia y Antártica (CEQUA), Punta Arenas, Chile
| | - Susana De la Torre-Zavala
- Universidad Autónoma de Nuevo León, Facultad de Ciencias Biológicas, Instituto de Biotecnología, San Nicolás de los Garza, México
| |
Collapse
|
24
|
Lu J, Xiong R, Tian J, Wang C, Sun F. Deep learning to estimate lithium-ion battery state of health without additional degradation experiments. Nat Commun 2023; 14:2760. [PMID: 37179411 PMCID: PMC10183024 DOI: 10.1038/s41467-023-38458-w] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 05/03/2023] [Indexed: 05/15/2023] Open
Abstract
State of health is a critical state which evaluates the degradation level of batteries. However, it cannot be measured directly but requires estimation. While accurate state of health estimation has progressed markedly, the time- and resource-consuming degradation experiments to generate target battery labels hinder the development of state of health estimation methods. In this article, we design a deep-learning framework to enable the estimation of battery state of health in the absence of target battery labels. This framework integrates a swarm of deep neural networks equipped with domain adaptation to produce accurate estimation. We employ 65 commercial batteries from 5 different manufacturers to generate 71,588 samples for cross-validation. The validation results indicate that the proposed framework can ensure absolute errors of less than 3% for 89.4% of samples (less than 5% for 98.9% of samples), with a maximum absolute error of less than 8.87% in the absence of target labels. This work emphasizes the power of deep learning in precluding degradation experiments and highlights the promise of rapid development of battery management algorithms for new-generation batteries using only previous experimental data.
Collapse
Affiliation(s)
- Jiahuan Lu
- Department of Vehicle Engineering, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China
| | - Rui Xiong
- Department of Vehicle Engineering, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China.
| | - Jinpeng Tian
- Department of Vehicle Engineering, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China.
| | - Chenxu Wang
- Department of Vehicle Engineering, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China
| | - Fengchun Sun
- Department of Vehicle Engineering, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China
| |
Collapse
|
25
|
'Small Data' for big insights in ecology. Trends Ecol Evol 2023:S0169-5347(23)00019-8. [PMID: 36797167 DOI: 10.1016/j.tree.2023.01.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 01/18/2023] [Accepted: 01/25/2023] [Indexed: 02/17/2023]
Abstract
Big Data science has significantly furthered our understanding of complex systems by harnessing large volumes of data, generated at high velocity and in great variety. However, there is a risk that Big Data collection is prioritised to the detriment of 'Small Data' (data with few observations). This poses a particular risk to ecology where Small Data abounds. Machine learning experts are increasingly looking to Small Data to drive the next generation of innovation, leading to development in methods for Small Data such as transfer learning, knowledge graphs, and synthetic data. Meanwhile, meta-analysis and causal reasoning approaches are evolving to provide new insights from Small Data. These advances should add value to high-quality Small Data catalysing future insights for ecology.
Collapse
|
26
|
Yang X, Qin S, Liu X, Zhang N, Chen J, Jin M, Liu F, Wang Y, Guo J, Shi H, Wang C, Chen Y. Meta-Viromic Sequencing Reveals Virome Characteristics of Mosquitoes and Culicoides on Zhoushan Island, China. Microbiol Spectr 2023; 11:e0268822. [PMID: 36651764 PMCID: PMC9927462 DOI: 10.1128/spectrum.02688-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Mosquitoes and biting Culicoides species are arbovirus vectors. Effective virome profile surveillance is essential for the prevention and control of insect-borne diseases. From June to September 2021, we collected eight species of female mosquito and Culicoides on Zhoushan Island, China, and used meta-viromic sequencing to analyze their virome compositions and characteristics. The classified virus reads were distributed in 191 genera in 66 families. The virus sequences in mosquitoes with the largest proportions were Iflaviridae (30.03%), Phasmaviridae (23.09%), Xinmoviridae (21.82%), Flaviviridae (13.44%), and Rhabdoviridae (8.40%). Single-strand RNA+ viruses formed the largest proportions of viruses in all samples. Blood meals indicated that blood-sucking mosquito hosts were mainly chicken, duck, pig, and human, broadly consistent with the habitats where the mosquitoes were collected. Novel viruses of the Orthobunyavirus, Narnavirus, and Iflavirus genera were found in Culicoides by de-novo assembly. The viruses with vertebrate hosts carried by mosquitoes and Culicoides also varied widely. The analysis of unclassified viruses and deep-learning analysis of the "dark matter" in the meta-viromic sequencing data revealed the presence of a large number of unknown viruses. IMPORTANCE The monitoring of the viromes of mosquitoes and Culicoides, widely distributed arbovirus transmission vectors, is crucial to evaluate the risk of infectious disease transmission. In this study, the compositions of the viromes of mosquitoes and Culicoides on Zhoushan Island varied widely and were related mainly to the host species, with different host species having different core viromes. and many unknown sequences in the Culicoides viromes remain to be annotated, suggesting the presence of a large number of unknown viruses.
Collapse
Affiliation(s)
- Xiaojing Yang
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Shiyu Qin
- College of Public Health, Zhengzhou University, Zhengzhou, Henan Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Xiong Liu
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Na Zhang
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Jiali Chen
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Meiling Jin
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Fangni Liu
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Yong Wang
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Jinpeng Guo
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Hua Shi
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Changjun Wang
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Yong Chen
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| |
Collapse
|