1
|
Urhan A, Cosma BM, Earl AM, Manson AL, Abeel T. SAFPred: synteny-aware gene function prediction for bacteria using protein embeddings. Bioinformatics 2024; 40:btae328. [PMID: 38775729 PMCID: PMC11147799 DOI: 10.1093/bioinformatics/btae328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 04/08/2024] [Accepted: 05/21/2024] [Indexed: 06/04/2024] Open
Abstract
MOTIVATION Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models-adopted from the natural language processing field-have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes. RESULTS To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health. AVAILABILITY AND IMPLEMENTATION https://github.com/AbeelLab/safpred.
Collapse
Affiliation(s)
- Aysun Urhan
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Delft XE 2628, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
| | - Bianca-Maria Cosma
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Delft XE 2628, The Netherlands
| | - Ashlee M Earl
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
| | - Abigail L Manson
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Delft XE 2628, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
| |
Collapse
|
2
|
Quan P, Li X, Si Y, Sun L, Ding FF, Fan Y, Liu H, Wei C, Li R, Zhao X, Yang F, Yao L. Single cell analysis reveals the roles and regulatory mechanisms of type-I interferons in Parkinson's disease. Cell Commun Signal 2024; 22:212. [PMID: 38566100 PMCID: PMC10985960 DOI: 10.1186/s12964-024-01590-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 03/23/2024] [Indexed: 04/04/2024] Open
Abstract
The pathogenesis of Parkinson's disease (PD) is strongly associated with neuroinflammation, and type I interferons (IFN-I) play a crucial role in regulating immune and inflammatory responses. However, the specific features of IFN in different cell types and the underlying mechanisms of PD have yet to be fully described. In this study, we analyzed the GSE157783 dataset, which includes 39,024 single-cell RNA sequencing results for five PD patients and six healthy controls from the Gene Expression Omnibus database. After cell type annotation, we intersected differentially expressed genes in each cell subcluster with genes collected in The Interferome database to generate an IFN-I-stimulated gene set (ISGs). Based on this gene set, we used the R package AUCell to score each cell, representing the IFN-I activity. Additionally, we performed monocle trajectory analysis, and single-cell regulatory network inference and clustering (SCENIC) to uncover the underlying mechanisms. In silico gene perturbation and subsequent experiments confirm NFATc2 regulation of type I interferon response and neuroinflammation. Our analysis revealed that microglia, endothelial cells, and pericytes exhibited the highest activity of IFN-I. Furthermore, single-cell trajectory detection demonstrated that microglia in the midbrain of PD patients were in a pro-inflammatory activation state, which was validated in the 1-Methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP)-induced PD mouse model as well. We identified transcription factors NFATc2, which was significantly up-regulated and involved in the expression of ISGs and activation of microglia in PD. In the 1-Methyl-4-phenylpyridinium (MPP+)-induced BV2 cell model, the suppression of NFATc2 resulted in a reduction in IFN-β levels, impeding the phosphorylation of STAT1, and attenuating the activation of the NF-κB pathway. Furthermore, the downregulation of NFATc2 mitigated the detrimental effects on SH-SY5Y cells co-cultured in conditioned medium. Our study highlights the critical role of microglia in type I interferon responses in PD. Additionally, we identified transcription factors NFATc2 as key regulators of aberrant type I interferon responses and microglial pro-inflammatory activation in PD. These findings provide new insights into the pathogenesis of PD and may have implications for the development of novel therapeutic strategies.
Collapse
Affiliation(s)
- Pusheng Quan
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
- Department of Neurology, The Affiliated Hospital of Inner Mongolia Medical University, Hohhot, China
| | - Xueying Li
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Yao Si
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Linlin Sun
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Fei Fan Ding
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Yuwei Fan
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Han Liu
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Chengqun Wei
- Department of General Practice, Heilongjiang Provincial Hospital, Harbin, China
| | - Ruihua Li
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Xue Zhao
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Fan Yang
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China.
| | - Lifen Yao
- Department of Neurology, The First Affiliated Hospital, Harbin Medical University, Harbin, China.
| |
Collapse
|
3
|
Chen K, Zhou Y, Ding M, Wang Y, Ren Z, Yang Y. Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Brief Bioinform 2024; 25:bbae163. [PMID: 38605640 PMCID: PMC11009468 DOI: 10.1093/bib/bbae163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 02/22/2024] [Accepted: 03/19/2024] [Indexed: 04/13/2024] Open
Abstract
Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequence-based modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.
Collapse
Affiliation(s)
- Ken Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yue Zhou
- Peng Cheng Laboratory, Shenzhen, China
| | - Maolin Ding
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yu Wang
- Peng Cheng Laboratory, Shenzhen, China
| | | | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
- Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, China
| |
Collapse
|
4
|
Ligeti B, Szepesi-Nagy I, Bodnár B, Ligeti-Nagy N, Juhász J. ProkBERT family: genomic language models for microbiome applications. Front Microbiol 2024; 14:1331233. [PMID: 38282738 PMCID: PMC10810988 DOI: 10.3389/fmicb.2023.1331233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 12/11/2023] [Indexed: 01/30/2024] Open
Abstract
Background In the evolving landscape of microbiology and microbiome analysis, the integration of machine learning is crucial for understanding complex microbial interactions, and predicting and recognizing novel functionalities within extensive datasets. However, the effectiveness of these methods in microbiology faces challenges due to the complex and heterogeneous nature of microbial data, further complicated by low signal-to-noise ratios, context-dependency, and a significant shortage of appropriately labeled datasets. This study introduces the ProkBERT model family, a collection of large language models, designed for genomic tasks. It provides a generalizable sequence representation for nucleotide sequences, learned from unlabeled genome data. This approach helps overcome the above-mentioned limitations in the field, thereby improving our understanding of microbial ecosystems and their impact on health and disease. Methods ProkBERT models are based on transfer learning and self-supervised methodologies, enabling them to use the abundant yet complex microbial data effectively. The introduction of the novel Local Context-Aware (LCA) tokenization technique marks a significant advancement, allowing ProkBERT to overcome the contextual limitations of traditional transformer models. This methodology not only retains rich local context but also demonstrates remarkable adaptability across various bioinformatics tasks. Results In practical applications such as promoter prediction and phage identification, the ProkBERT models show superior performance. For promoter prediction tasks, the top-performing model achieved a Matthews Correlation Coefficient (MCC) of 0.74 for E. coli and 0.62 in mixed-species contexts. In phage identification, ProkBERT models consistently outperformed established tools like VirSorter2 and DeepVirFinder, achieving an MCC of 0.85. These results underscore the models' exceptional accuracy and generalizability in both supervised and unsupervised tasks. Conclusions The ProkBERT model family is a compact yet powerful tool in the field of microbiology and bioinformatics. Its capacity for rapid, accurate analyses and its adaptability across a spectrum of tasks marks a significant advancement in machine learning applications in microbiology. The models are available on GitHub (https://github.com/nbrg-ppcu/prokbert) and HuggingFace (https://huggingface.co/nerualbioinfo) providing an accessible tool for the community.
Collapse
Affiliation(s)
- Balázs Ligeti
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - István Szepesi-Nagy
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - Babett Bodnár
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - Noémi Ligeti-Nagy
- Language Technology Research Group, HUN-REN Hungarian Research Centre for Linguistics, Budapest, Hungary
| | - János Juhász
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
- Institute of Medical Microbiology, Semmelweis University, Budapest, Hungary
| |
Collapse
|
5
|
McGuinness KN, Fehon N, Feehan R, Miller M, Mutter AC, Rybak LA, Nam J, AbuSalim JE, Atkinson JT, Heidari H, Losada N, Kim JD, Koder RL, Lu Y, Silberg JJ, Slusky JSG, Falkowski PG, Nanda V. The energetics and evolution of oxidoreductases in deep time. Proteins 2024; 92:52-59. [PMID: 37596815 DOI: 10.1002/prot.26563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/06/2023] [Indexed: 08/20/2023]
Abstract
The core metabolic reactions of life drive electrons through a class of redox protein enzymes, the oxidoreductases. The energetics of electron flow is determined by the redox potentials of organic and inorganic cofactors as tuned by the protein environment. Understanding how protein structure affects oxidation-reduction energetics is crucial for studying metabolism, creating bioelectronic systems, and tracing the history of biological energy utilization on Earth. We constructed ProtReDox (https://protein-redox-potential.web.app), a manually curated database of experimentally determined redox potentials. With over 500 measurements, we can begin to identify how proteins modulate oxidation-reduction energetics across the tree of life. By mapping redox potentials onto networks of oxidoreductase fold evolution, we can infer the evolution of electron transfer energetics over deep time. ProtReDox is designed to include user-contributed submissions with the intention of making it a valuable resource for researchers in this field.
Collapse
Affiliation(s)
- Kenneth N McGuinness
- Department of Natural Sciences, Caldwell University, Caldwell, New Jersey, USA
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - Nolan Fehon
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Ryan Feehan
- Computational Biology Program, The University of Kansas, Lawrence, Kansas, USA
| | - Michelle Miller
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Andrew C Mutter
- Department of Physics, The City College of New York, New York, New York, USA
| | - Laryssa A Rybak
- Department of Physics, The City College of New York, New York, New York, USA
| | - Justin Nam
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - Jenna E AbuSalim
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - Joshua T Atkinson
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, USA
| | - Hirbod Heidari
- Department of Chemistry, University of Texas at Austin, Austin, Texas, USA
| | - Natalie Losada
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - J Dongun Kim
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Ronald L Koder
- Department of Physics, The City College of New York, New York, New York, USA
| | - Yi Lu
- Department of Chemistry, University of Texas at Austin, Austin, Texas, USA
| | - Jonathan J Silberg
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, USA
| | - Joanna S G Slusky
- Computational Biology Program, The University of Kansas, Lawrence, Kansas, USA
- Department of Molecular Biosciences, The University of Kansas, Lawrence, Kansas, USA
| | - Paul G Falkowski
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
- Department of Earth and Planetary Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Vikas Nanda
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
- Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers University, Piscataway, New Jersey, USA
| |
Collapse
|
6
|
Aliperti L, Aptekmann AA, Farfañuk G, Couso LL, Soler-Bistué A, Sánchez IE. r/K selection of GC content in prokaryotes. Environ Microbiol 2023; 25:3255-3268. [PMID: 37813828 DOI: 10.1111/1462-2920.16511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 09/16/2023] [Indexed: 10/11/2023]
Abstract
The guanine/cytosine (GC) content of prokaryotic genomes is species-specific, taking values from 16% to 77%. This diversity of selection for GC content remains contentious. We analyse the correlations between GC content and a range of phenotypic and genotypic data in thousands of prokaryotes. GC content integrates well with these traits into r/K selection theory when phenotypic plasticity is considered. High GC-content prokaryotes are r-strategists with cheaper descendants thanks to a lower average amino acid metabolic cost, colonize unstable environments thanks to flagella and a bacillus form and are generalists in terms of resource opportunism and their defence mechanisms. Low GC content prokaryotes are K-strategists specialized for stable environments that maintain homeostasis via a high-cost outer cell membrane and endospore formation as a response to nutrient deprivation, and attain a higher nutrient-to-biomass yield. The lower proteome cost of high GC content prokaryotes is driven by the association between GC-rich codons and cheaper amino acids in the genetic code, while the correlation between GC content and genome size may be partly due to functional diversity driven by r/K selection. In all, molecular diversity in the GC content of prokaryotes may be a consequence of ecological r/K selection.
Collapse
Affiliation(s)
- Lucio Aliperti
- Facultad de Ciencias Exactas y Naturales. Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Ariel A Aptekmann
- Marine and Coastal Sciences Department, Rutgers University, New Brunswick, New Jersey, USA
| | - Gonzalo Farfañuk
- Facultad de Ciencias Exactas y Naturales. Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Luciana L Couso
- Facultad de Agronomía, Cátedra de Genética, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Alfonso Soler-Bistué
- Instituto de Investigaciones Biotecnológicas Dr. Rodolfo A. Ugalde, CONICET, Universidad Nacional de San Martín, San Martin, Argentina
| | - Ignacio E Sánchez
- Facultad de Ciencias Exactas y Naturales. Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos Aires, Argentina
| |
Collapse
|
7
|
Ma B, Lu C, Wang Y, Yu J, Zhao K, Xue R, Ren H, Lv X, Pan R, Zhang J, Zhu Y, Xu J. A genomic catalogue of soil microbiomes boosts mining of biodiversity and genetic resources. Nat Commun 2023; 14:7318. [PMID: 37951952 PMCID: PMC10640626 DOI: 10.1038/s41467-023-43000-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 10/27/2023] [Indexed: 11/14/2023] Open
Abstract
Soil harbors a vast expanse of unidentified microbes, termed as microbial dark matter, presenting an untapped reservo)ir of microbial biodiversity and genetic resources, but has yet to be fully explored. In this study, we conduct a large-scale excavation of soil microbial dark matter by reconstructing 40,039 metagenome-assembled genome bins (the SMAG catalogue) from 3304 soil metagenomes. We identify 16,530 of 21,077 species-level genome bins (SGBs) as unknown SGBs (uSGBs), which expand archaeal and bacterial diversity across the tree of life. We also illustrate the pivotal role of uSGBs in augmenting soil microbiome's functional landscape and intra-species genome diversity, providing large proportions of the 43,169 biosynthetic gene clusters and 8545 CRISPR-Cas genes. Additionally, we determine that uSGBs contributed 84.6% of previously unexplored viral-host associations from the SMAG catalogue. The SMAG catalogue provides an useful genomic resource for further studies investigating soil microbial biodiversity and genetic resources.
Collapse
Affiliation(s)
- Bin Ma
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
- Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Caiyu Lu
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
- Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Yiling Wang
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
- Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Jingwen Yu
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Kankan Zhao
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China
- Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China
| | - Ran Xue
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Hao Ren
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Xiaofei Lv
- Department of Environmental Engineering, China Jiliang University, Hangzhou, 310018, China
| | - Ronghui Pan
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Jiabao Zhang
- State Key Laboratory of Soil and Sustainable Agriculture, Institute of Soil Science, Chinese Academy of Sciences, Nanjing, 210008, China
| | - Yongguan Zhu
- Research Center for Eco-environmental Sciences, Chinese Academy of Sciences, Beijing, 100085, China
| | - Jianming Xu
- Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China.
- Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China.
| |
Collapse
|
8
|
Benegas G, Batra SS, Song YS. DNA language models are powerful predictors of genome-wide variant effects. Proc Natl Acad Sci U S A 2023; 120:e2311219120. [PMID: 37883436 PMCID: PMC10622914 DOI: 10.1073/pnas.2311219120] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 09/08/2023] [Indexed: 10/28/2023] Open
Abstract
The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Graduate Group in Computational Biology, University of California, Berkeley, CA94720
| | | | - Yun S. Song
- Computer Science Division, University of California, Berkeley, CA94720
- Department of Statistics, University of California, Berkeley, CA94720
- Center for Computational Biology, University of California, Berkeley, CA94720
| |
Collapse
|
9
|
Mahlich Y, Zhu C, Chung H, Velaga PK, De Paolis Kaluza M, Radivojac P, Friedberg I, Bromberg Y. Learning from the unknown: exploring the range of bacterial functionality. Nucleic Acids Res 2023; 51:10162-10175. [PMID: 37739408 PMCID: PMC10602916 DOI: 10.1093/nar/gkad757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 09/11/2023] [Indexed: 09/24/2023] Open
Abstract
Determining the repertoire of a microbe's molecular functions is a central question in microbial biology. Modern techniques achieve this goal by comparing microbial genetic material against reference databases of functionally annotated genes/proteins or known taxonomic markers such as 16S rRNA. Here, we describe a novel approach to exploring bacterial functional repertoires without reference databases. Our Fusion scheme establishes functional relationships between bacteria and assigns organisms to Fusion-taxa that differ from otherwise defined taxonomic clades. Three key findings of our work stand out. First, bacterial functional comparisons outperform marker genes in assigning taxonomic clades. Fusion profiles are also better for this task than other functional annotation schemes. Second, Fusion-taxa are robust to addition of novel organisms and are, arguably, able to capture the environment-driven bacterial diversity. Finally, our alignment-free nucleic acid-based Siamese Neural Network model, created using Fusion functions, enables finding shared functionality of very distant, possibly structurally different, microbial homologs. Our work can thus help annotate functional repertoires of bacterial organisms and further guide our understanding of microbial communities.
Collapse
Affiliation(s)
- Yannick Mahlich
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - Chengsheng Zhu
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
- Xbiome Inc., 1 Broadway, 14th fl, Cambridge, MA 02142, USA
| | - Henri Chung
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
- Interdepartmental program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA 50011, USA
| | - Pavan K Velaga
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
| | - M Clara De Paolis Kaluza
- Khoury College of Computer Sciences, Northeastern University, 177 Huntington Avenue, Boston, MA 02115, USA
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, 177 Huntington Avenue, Boston, MA 02115, USA
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
- Interdepartmental program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA 50011, USA
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New Brunswick, NJ 08873, USA
- Department of Biology, Emory University, 1510 Clifton Road NE, Atlanta, GA 30322, USA
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA 30322, USA
| |
Collapse
|
10
|
Medina-Chávez NO, Viladomat-Jasso M, Zarza E, Islas-Robles A, Valdivia-Anistro J, Thalasso-Siret F, Eguiarte LE, Olmedo-Álvarez G, Souza V, De la Torre-Zavala S. A Transiently Hypersaline Microbial Mat Harbors a Diverse and Stable Archaeal Community in the Cuatro Cienegas Basin, Mexico. ASTROBIOLOGY 2023; 23:796-811. [PMID: 37279013 DOI: 10.1089/ast.2021.0047] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Microbial mats are biologically diverse communities that are analogs to some of the earliest ecosystems on Earth. In this study, we describe a unique transiently hypersaline microbial mat uncovered in a shallow pond within the Cuatro Cienegas Basin (CCB) in northern México. The CCB is an endemism-rich site that harbors living stromatolites that have been studied to understand the conditions of the Precambrian Earth. These microbial mats form elastic domes filled with biogenic gas, and the mats have a relatively large and stable subpopulation of archaea. For this reason, this site has been termed archaean domes (AD). The AD microbial community was analyzed by metagenomics over three seasons. The mat exhibited a highly diverse prokaryotic community dominated by bacteria. Bacterial sequences are represented in 37 phyla, mainly Proteobacteria, Firmicutes, and Actinobacteria, that together comprised >50% of the sequences from the mat. Archaea represented up to 5% of the retrieved sequences, with up to 230 different archaeal species that belong to 5 phyla (Euryarchaeota, Crenarchaeota, Thaumarchaeota, Korarchaeota, and Nanoarchaeota). The archaeal taxa showed low variation despite fluctuations in water and nutrient availability. In addition, predicted functions highlight stress responses to extreme conditions present in the AD, including salinity, pH, and water/drought fluctuation. The observed complexity of the AD mat thriving in high pH and fluctuating water and salt conditions within the CCB provides an extant model of great value for evolutionary studies, as well as a suitable analog to the early Earth and Mars.
Collapse
Affiliation(s)
- Nahui-Olin Medina-Chávez
- Ecology, Evolution and Behavior, University of Minnesota, St. Paul, Minnesota, USA
- Universidad Autónoma de Nuevo León, Facultad de Ciencias Biológicas, Instituto de Biotecnología, San Nicolás de los Garza, México
| | | | - Eugenia Zarza
- Departamento de Ciencias de la Sustentabilidad, El Colegio de la Frontera Sur, Tapachula, Mexico
- Consejo Nacional de Ciencia y Tecnología, Ciudad de México, México
| | - Africa Islas-Robles
- Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del I.P.N. Campus Irapuato, Irapuato, México
| | - Jorge Valdivia-Anistro
- Unidad Multidisciplinaria de Investigación Experimental Zaragoza, Facultad de Estudios Superiores Zaragoza, UNAM, Ciudad de México, México
| | - Frédéric Thalasso-Siret
- Departamento de Biotecnología y Bioingeniería, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional, Ciudad de México, Mexico
| | - Luis E Eguiarte
- Departamento de Ecología Evolutiva, Instituto de Ecología, UNAM, Ciudad de México, México
- Centro de Estudios del Cuaternario de Fuego-Patagonia y Antártica (CEQUA), Punta Arenas, Chile
| | - Gabriela Olmedo-Álvarez
- Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del I.P.N. Campus Irapuato, Irapuato, México
| | - Valeria Souza
- Departamento de Ecología Evolutiva, Instituto de Ecología, UNAM, Ciudad de México, México
- Centro de Estudios del Cuaternario de Fuego-Patagonia y Antártica (CEQUA), Punta Arenas, Chile
| | - Susana De la Torre-Zavala
- Universidad Autónoma de Nuevo León, Facultad de Ciencias Biológicas, Instituto de Biotecnología, San Nicolás de los Garza, México
| |
Collapse
|
11
|
Lu J, Xiong R, Tian J, Wang C, Sun F. Deep learning to estimate lithium-ion battery state of health without additional degradation experiments. Nat Commun 2023; 14:2760. [PMID: 37179411 PMCID: PMC10183024 DOI: 10.1038/s41467-023-38458-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 05/03/2023] [Indexed: 05/15/2023] Open
Abstract
State of health is a critical state which evaluates the degradation level of batteries. However, it cannot be measured directly but requires estimation. While accurate state of health estimation has progressed markedly, the time- and resource-consuming degradation experiments to generate target battery labels hinder the development of state of health estimation methods. In this article, we design a deep-learning framework to enable the estimation of battery state of health in the absence of target battery labels. This framework integrates a swarm of deep neural networks equipped with domain adaptation to produce accurate estimation. We employ 65 commercial batteries from 5 different manufacturers to generate 71,588 samples for cross-validation. The validation results indicate that the proposed framework can ensure absolute errors of less than 3% for 89.4% of samples (less than 5% for 98.9% of samples), with a maximum absolute error of less than 8.87% in the absence of target labels. This work emphasizes the power of deep learning in precluding degradation experiments and highlights the promise of rapid development of battery management algorithms for new-generation batteries using only previous experimental data.
Collapse
Affiliation(s)
- Jiahuan Lu
- Department of Vehicle Engineering, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China
| | - Rui Xiong
- Department of Vehicle Engineering, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China.
| | - Jinpeng Tian
- Department of Vehicle Engineering, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China.
| | - Chenxu Wang
- Department of Vehicle Engineering, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China
| | - Fengchun Sun
- Department of Vehicle Engineering, School of Mechanical Engineering, Beijing Institute of Technology, Beijing, 100081, China
| |
Collapse
|
12
|
'Small Data' for big insights in ecology. Trends Ecol Evol 2023:S0169-5347(23)00019-8. [PMID: 36797167 DOI: 10.1016/j.tree.2023.01.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 01/18/2023] [Accepted: 01/25/2023] [Indexed: 02/17/2023]
Abstract
Big Data science has significantly furthered our understanding of complex systems by harnessing large volumes of data, generated at high velocity and in great variety. However, there is a risk that Big Data collection is prioritised to the detriment of 'Small Data' (data with few observations). This poses a particular risk to ecology where Small Data abounds. Machine learning experts are increasingly looking to Small Data to drive the next generation of innovation, leading to development in methods for Small Data such as transfer learning, knowledge graphs, and synthetic data. Meanwhile, meta-analysis and causal reasoning approaches are evolving to provide new insights from Small Data. These advances should add value to high-quality Small Data catalysing future insights for ecology.
Collapse
|
13
|
Yang X, Qin S, Liu X, Zhang N, Chen J, Jin M, Liu F, Wang Y, Guo J, Shi H, Wang C, Chen Y. Meta-Viromic Sequencing Reveals Virome Characteristics of Mosquitoes and Culicoides on Zhoushan Island, China. Microbiol Spectr 2023; 11:e0268822. [PMID: 36651764 PMCID: PMC9927462 DOI: 10.1128/spectrum.02688-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Mosquitoes and biting Culicoides species are arbovirus vectors. Effective virome profile surveillance is essential for the prevention and control of insect-borne diseases. From June to September 2021, we collected eight species of female mosquito and Culicoides on Zhoushan Island, China, and used meta-viromic sequencing to analyze their virome compositions and characteristics. The classified virus reads were distributed in 191 genera in 66 families. The virus sequences in mosquitoes with the largest proportions were Iflaviridae (30.03%), Phasmaviridae (23.09%), Xinmoviridae (21.82%), Flaviviridae (13.44%), and Rhabdoviridae (8.40%). Single-strand RNA+ viruses formed the largest proportions of viruses in all samples. Blood meals indicated that blood-sucking mosquito hosts were mainly chicken, duck, pig, and human, broadly consistent with the habitats where the mosquitoes were collected. Novel viruses of the Orthobunyavirus, Narnavirus, and Iflavirus genera were found in Culicoides by de-novo assembly. The viruses with vertebrate hosts carried by mosquitoes and Culicoides also varied widely. The analysis of unclassified viruses and deep-learning analysis of the "dark matter" in the meta-viromic sequencing data revealed the presence of a large number of unknown viruses. IMPORTANCE The monitoring of the viromes of mosquitoes and Culicoides, widely distributed arbovirus transmission vectors, is crucial to evaluate the risk of infectious disease transmission. In this study, the compositions of the viromes of mosquitoes and Culicoides on Zhoushan Island varied widely and were related mainly to the host species, with different host species having different core viromes. and many unknown sequences in the Culicoides viromes remain to be annotated, suggesting the presence of a large number of unknown viruses.
Collapse
Affiliation(s)
- Xiaojing Yang
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Shiyu Qin
- College of Public Health, Zhengzhou University, Zhengzhou, Henan Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Xiong Liu
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Na Zhang
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Jiali Chen
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Meiling Jin
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Fangni Liu
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Yong Wang
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Jinpeng Guo
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Hua Shi
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Changjun Wang
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Yong Chen
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| |
Collapse
|