Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Hoarfrost A, Aptekmann A, Farfañuk G, Bromberg Y. Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter. Nat Commun 2022;13:2606. [PMID: 35545619 PMCID: PMC9095714 DOI: 10.1038/s41467-022-30070-8] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 03/30/2022] [Indexed: 12/22/2022] Open

For:	Hoarfrost A, Aptekmann A, Farfañuk G, Bromberg Y. Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter. Nat Commun 2022;13:2606. [PMID: 35545619 PMCID: PMC9095714 DOI: 10.1038/s41467-022-30070-8] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 03/30/2022] [Indexed: 12/22/2022] Open

Number

Cited by Other Article(s)

Jamshidi MB, Hoang DT, Nguyen DN, Niyato D, Warkiani ME. Revolutionizing biological digital twins: Integrating internet of bio-nano things, convolutional neural networks, and federated learning. Comput Biol Med 2025;189:109970. [PMID: 40101583 DOI: 10.1016/j.compbiomed.2025.109970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 02/28/2025] [Accepted: 03/01/2025] [Indexed: 03/20/2025]

Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic language models: opportunities and challenges. Trends Genet 2025;41:286-302. [PMID: 39753409 DOI: 10.1016/j.tig.2024.11.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 11/21/2024] [Accepted: 11/21/2024] [Indexed: 04/10/2025]

Duan C, Zang Z, Xu Y, He H, Li S, Liu Z, Lei Z, Zheng JS, Li SZ. FGeneBERT: function-driven pre-trained gene language model for metagenomics. Brief Bioinform 2025;26:bbaf149. [PMID: 40211978 PMCID: PMC11986344 DOI: 10.1093/bib/bbaf149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2024] [Revised: 02/22/2025] [Accepted: 03/14/2025] [Indexed: 04/14/2025] Open

Affiliation(s)

Chenrui Duan College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
Zelin Zang Centre for Artificial Intelligence and Robotics (CAIR), HKISI-CAS Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong 310000, China
Yongjie Xu College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
Hang He School of Medicine and School of Life Sciences, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
Siyuan Li College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
Zihan Liu College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
Zhen Lei Centre for Artificial Intelligence and Robotics (CAIR), HKISI-CAS Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong 310000, China State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China
Ju-Sheng Zheng School of Medicine and School of Life Sciences, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
Stan Z Li School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China

Collapse

Prabakaran R, Bromberg Y. Functional profiling of the sequence stockpile: a protein pair-based assessment of in silico prediction tools. Bioinformatics 2025;41:btaf035. [PMID: 39854283 PMCID: PMC11821270 DOI: 10.1093/bioinformatics/btaf035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 11/04/2024] [Accepted: 01/22/2025] [Indexed: 01/26/2025] Open

Abstract

MOTIVATION

In silico functional annotation of proteins is crucial to narrowing the sequencing-accelerated gap in our understanding of protein activities. Numerous function annotation methods exist, and their ranks have been growing, particularly so with the recent deep learning-based developments. However, it is unclear if these tools are truly predictive. As we are not aware of any methods that can identify new terms in functional ontologies, we ask if they can, at least, identify molecular functions of proteins that are non-homologous to or far-removed from known protein families.

RESULTS

Here, we explore the potential and limitations of the existing methods in predicting the molecular functions of thousands of such proteins. Lacking the "ground truth" functional annotations, we transformed the assessment of function prediction into evaluation of functional similarity of protein pairs that likely share function but are unlike any of the currently functionally annotated sequences. Notably, our approach transcends the limitations of functional annotation vocabularies, providing a means to assess different-ontology annotation methods. We find that most existing methods are limited to identifying functional similarity of homologous sequences and fail to predict the function of proteins lacking reference. Curiously, despite their seemingly unlimited by-homology scope, deep learning methods also have trouble capturing the functional signal encoded in protein sequence. We believe that our work will inspire the development of a new generation of methods that push boundaries and promote exploration and discovery in the molecular function domain.

AVAILABILITY AND IMPLEMENTATION

The data underlying this article are available at https://doi.org/10.6084/m9.figshare.c.6737127.v3. The code used to compute siblings is available openly at https://bitbucket.org/bromberglab/siblings-detector/.

Collapse

Mi K, Xu R, Liu X. RFW captures species-level metagenomic functions by integrating genome annotation information. CELL REPORTS METHODS 2024;4:100932. [PMID: 39662474 PMCID: PMC11704624 DOI: 10.1016/j.crmeth.2024.100932] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2024] [Revised: 09/01/2024] [Accepted: 11/14/2024] [Indexed: 12/13/2024]

Dou Z, He J, Han C, Wu X, Wan L, Yang J, Zheng Y, Gong B, Wang L. qProtein: Exploring Physical Features of Protein Thermostability Based on Structural Proteomics. J Chem Inf Model 2024;64:7885-7894. [PMID: 39375829 DOI: 10.1021/acs.jcim.4c01303] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/09/2024]

Abstract

Thermostability, which is essential for the functional performance of enzymes, is largely determined by intramolecular physical interactions. Although many tools have been developed, existing computational methods have struggled to find the universal principles of protein thermostability. Recent advancements in structural proteomics have been driven by the introduction of deep neural networks such as AlphaFold2 and ESMFold. These innovations have enabled the characterization of protein structures with unprecedented speed and accuracy. Here, we introduce qProtein, a Python-implemented workflow designed for the quantitative analysis of physical interactions on the scale of structural proteomics. This platform accepts protein sequences as input and produces four structural features, including hydrophobic clusters, hydrogen bonds, electrostatic interactions, and disulfide bonds. To demonstrate the use of qProtein, we investigate the structural features related to protein thermostability in six glycoside hydrolase (GH) families, comprising a total of 3,811 protein structures. Our results indicate that in five enzyme families (GH11, GH12, GH5_2, GH10, and GH48), the thermophilic enzymes have a larger average area of hydrophobic clusters compared to the nonthermophilic enzymes within each family. Furthermore, our analysis of the local-structure regions reveals that the hydrophobic clusters are predominantly distributed in the distal regions of the GH11 enzymes. In addition, the average hydrophobic cluster area of the thermophilic enzymes is significantly higher than that of the nonthermophilic enzymes in the distal regions of the GH11 enzymes. Therefore, qProtein is a well-suited platform for analyzing the structural features of thermal stability at the level of structural proteomics. We provide the source code for qProtein at https://github.com/bj600800/qProtein, and the web server is available at http://qProtein.sdu.edu.cn:8888.

Collapse

Bobbo T, Biscarini F, Yaddehige SK, Alberghini L, Rigoni D, Bianchi N, Taccioli C. Machine learning classification of archaea and bacteria identifies novel predictive genomic features. BMC Genomics 2024;25:955. [PMID: 39402493 PMCID: PMC11472548 DOI: 10.1186/s12864-024-10832-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 09/24/2024] [Indexed: 10/19/2024] Open

Karavaeva V, Sousa FL. Navigating the archaeal frontier: insights and projections from bioinformatic pipelines. Front Microbiol 2024;15:1433224. [PMID: 39380680 PMCID: PMC11459464 DOI: 10.3389/fmicb.2024.1433224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Accepted: 08/28/2024] [Indexed: 10/10/2024] Open

Abstract

Archaea continues to be one of the least investigated domains of life, and in recent years, the advent of metagenomics has led to the discovery of many new lineages at the phylum level. For the majority, only automatic genomic annotations can provide information regarding their metabolic potential and role in the environment. Here, genomic data from 2,978 archaeal genomes was used to perform automatic annotations using bioinformatics tools, alongside synteny analysis. These automatic classifications were done to assess how good these different tools perform in relation to archaeal data. Our study revealed that even with lowered cutoffs, several functional models do not capture the recently discovered archaeal diversity. Moreover, our investigation revealed that a significant portion of archaeal genomes, approximately 42%, remain uncharacterized. In comparison, within 3,235 bacterial genomes, a diverse range of unclassified proteins is obtained, with well-studied organisms like Escherichia coli having a substantially lower proportion of uncharacterized regions, ranging from <5 to 25%, and less studied lineages being comparable to archaea with the range of 35-40% of unclassified regions. Leveraging this analysis, we were able to identify metabolic protein markers, thereby providing insights into the metabolism of the archaea in our dataset. Our findings underscore a substantial gap between automatic classification tools and the comprehensive mapping of archaeal metabolism. Despite advances in computational approaches, a significant portion of archaeal genomes remains unexplored, highlighting the need for extensive experimental validation in this domain, as well as more refined annotation methods. This study contributes to a better understanding of archaeal metabolism and underscores the importance of further research in elucidating the functional potential of archaeal genomes.

Collapse

Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic Language Models: Opportunities and Challenges. ARXIV 2024:arXiv:2407.11435v2. [PMID: 39070037 PMCID: PMC11275703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]

Fu Y, Yu S, Li J, Lao Z, Yang X, Lin Z. DeepMineLys: Deep mining of phage lysins from human microbiome. Cell Rep 2024;43:114583. [PMID: 39110597 DOI: 10.1016/j.celrep.2024.114583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 06/21/2024] [Accepted: 07/19/2024] [Indexed: 09/01/2024] Open

Mendoza-Revilla J, Trop E, Gonzalez L, Roller M, Dalla-Torre H, de Almeida BP, Richard G, Caton J, Lopez Carranza N, Skwark M, Laterre A, Beguir K, Pierrot T, Lopez M. A foundational large language model for edible plant genomes. Commun Biol 2024;7:835. [PMID: 38982288 PMCID: PMC11233511 DOI: 10.1038/s42003-024-06465-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 06/17/2024] [Indexed: 07/11/2024] Open

Qiu Z, Zhu Y, Zhang Q, Qiao X, Mu R, Xu Z, Yan Y, Wang F, Zhang T, Zhuang WQ, Yu K. Unravelling biosynthesis and biodegradation potentials of microbial dark matters in hypersaline lakes. ENVIRONMENTAL SCIENCE AND ECOTECHNOLOGY 2024;20:100359. [PMID: 39221074 PMCID: PMC11361885 DOI: 10.1016/j.ese.2023.100359] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 11/26/2023] [Accepted: 11/26/2023] [Indexed: 09/04/2024]

Urhan A, Cosma BM, Earl AM, Manson AL, Abeel T. SAFPred: synteny-aware gene function prediction for bacteria using protein embeddings. Bioinformatics 2024;40:btae328. [PMID: 38775729 PMCID: PMC11147799 DOI: 10.1093/bioinformatics/btae328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 04/08/2024] [Accepted: 05/21/2024] [Indexed: 06/04/2024] Open

Abstract

MOTIVATION

Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models-adopted from the natural language processing field-have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes.

RESULTS

To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health.

AVAILABILITY AND IMPLEMENTATION

https://github.com/AbeelLab/safpred.

Collapse

Quan P, Li X, Si Y, Sun L, Ding FF, Fan Y, Liu H, Wei C, Li R, Zhao X, Yang F, Yao L. Single cell analysis reveals the roles and regulatory mechanisms of type-I interferons in Parkinson's disease. Cell Commun Signal 2024;22:212. [PMID: 38566100 PMCID: PMC10985960 DOI: 10.1186/s12964-024-01590-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 03/23/2024] [Indexed: 04/04/2024] Open

Abstract

The pathogenesis of Parkinson's disease (PD) is strongly associated with neuroinflammation, and type I interferons (IFN-I) play a crucial role in regulating immune and inflammatory responses. However, the specific features of IFN in different cell types and the underlying mechanisms of PD have yet to be fully described. In this study, we analyzed the GSE157783 dataset, which includes 39,024 single-cell RNA sequencing results for five PD patients and six healthy controls from the Gene Expression Omnibus database. After cell type annotation, we intersected differentially expressed genes in each cell subcluster with genes collected in The Interferome database to generate an IFN-I-stimulated gene set (ISGs). Based on this gene set, we used the R package AUCell to score each cell, representing the IFN-I activity. Additionally, we performed monocle trajectory analysis, and single-cell regulatory network inference and clustering (SCENIC) to uncover the underlying mechanisms. In silico gene perturbation and subsequent experiments confirm NFATc2 regulation of type I interferon response and neuroinflammation. Our analysis revealed that microglia, endothelial cells, and pericytes exhibited the highest activity of IFN-I. Furthermore, single-cell trajectory detection demonstrated that microglia in the midbrain of PD patients were in a pro-inflammatory activation state, which was validated in the 1-Methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP)-induced PD mouse model as well. We identified transcription factors NFATc2, which was significantly up-regulated and involved in the expression of ISGs and activation of microglia in PD. In the 1-Methyl-4-phenylpyridinium (MPP+)-induced BV2 cell model, the suppression of NFATc2 resulted in a reduction in IFN-β levels, impeding the phosphorylation of STAT1, and attenuating the activation of the NF-κB pathway. Furthermore, the downregulation of NFATc2 mitigated the detrimental effects on SH-SY5Y cells co-cultured in conditioned medium. Our study highlights the critical role of microglia in type I interferon responses in PD. Additionally, we identified transcription factors NFATc2 as key regulators of aberrant type I interferon responses and microglial pro-inflammatory activation in PD. These findings provide new insights into the pathogenesis of PD and may have implications for the development of novel therapeutic strategies.

Collapse

Chen K, Zhou Y, Ding M, Wang Y, Ren Z, Yang Y. Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Brief Bioinform 2024;25:bbae163. [PMID: 38605640 PMCID: PMC11009468 DOI: 10.1093/bib/bbae163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 02/22/2024] [Accepted: 03/19/2024] [Indexed: 04/13/2024] Open

Ligeti B, Szepesi-Nagy I, Bodnár B, Ligeti-Nagy N, Juhász J. ProkBERT family: genomic language models for microbiome applications. Front Microbiol 2024;14:1331233. [PMID: 38282738 PMCID: PMC10810988 DOI: 10.3389/fmicb.2023.1331233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 12/11/2023] [Indexed: 01/30/2024] Open

Abstract

Background

In the evolving landscape of microbiology and microbiome analysis, the integration of machine learning is crucial for understanding complex microbial interactions, and predicting and recognizing novel functionalities within extensive datasets. However, the effectiveness of these methods in microbiology faces challenges due to the complex and heterogeneous nature of microbial data, further complicated by low signal-to-noise ratios, context-dependency, and a significant shortage of appropriately labeled datasets. This study introduces the ProkBERT model family, a collection of large language models, designed for genomic tasks. It provides a generalizable sequence representation for nucleotide sequences, learned from unlabeled genome data. This approach helps overcome the above-mentioned limitations in the field, thereby improving our understanding of microbial ecosystems and their impact on health and disease.

Methods

ProkBERT models are based on transfer learning and self-supervised methodologies, enabling them to use the abundant yet complex microbial data effectively. The introduction of the novel Local Context-Aware (LCA) tokenization technique marks a significant advancement, allowing ProkBERT to overcome the contextual limitations of traditional transformer models. This methodology not only retains rich local context but also demonstrates remarkable adaptability across various bioinformatics tasks.

Results

In practical applications such as promoter prediction and phage identification, the ProkBERT models show superior performance. For promoter prediction tasks, the top-performing model achieved a Matthews Correlation Coefficient (MCC) of 0.74 for E. coli and 0.62 in mixed-species contexts. In phage identification, ProkBERT models consistently outperformed established tools like VirSorter2 and DeepVirFinder, achieving an MCC of 0.85. These results underscore the models' exceptional accuracy and generalizability in both supervised and unsupervised tasks.

Conclusions

The ProkBERT model family is a compact yet powerful tool in the field of microbiology and bioinformatics. Its capacity for rapid, accurate analyses and its adaptability across a spectrum of tasks marks a significant advancement in machine learning applications in microbiology. The models are available on GitHub (https://github.com/nbrg-ppcu/prokbert) and HuggingFace (https://huggingface.co/nerualbioinfo) providing an accessible tool for the community.

Collapse

McGuinness KN, Fehon N, Feehan R, Miller M, Mutter AC, Rybak LA, Nam J, AbuSalim JE, Atkinson JT, Heidari H, Losada N, Kim JD, Koder RL, Lu Y, Silberg JJ, Slusky JSG, Falkowski PG, Nanda V. The energetics and evolution of oxidoreductases in deep time. Proteins 2024;92:52-59. [PMID: 37596815 DOI: 10.1002/prot.26563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/06/2023] [Indexed: 08/20/2023]

Affiliation(s)

Kenneth N McGuinness Department of Natural Sciences, Caldwell University, Caldwell, New Jersey, USA Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
Nolan Fehon Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
Ryan Feehan Computational Biology Program, The University of Kansas, Lawrence, Kansas, USA
Michelle Miller Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
Andrew C Mutter Department of Physics, The City College of New York, New York, New York, USA
Laryssa A Rybak Department of Physics, The City College of New York, New York, New York, USA
Justin Nam Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
Jenna E AbuSalim Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
Joshua T Atkinson Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, USA
Hirbod Heidari Department of Chemistry, University of Texas at Austin, Austin, Texas, USA
Natalie Losada Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
J Dongun Kim Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
Ronald L Koder Department of Physics, The City College of New York, New York, New York, USA
Yi Lu Department of Chemistry, University of Texas at Austin, Austin, Texas, USA
Jonathan J Silberg Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, USA
Joanna S G Slusky Computational Biology Program, The University of Kansas, Lawrence, Kansas, USA Department of Molecular Biosciences, The University of Kansas, Lawrence, Kansas, USA
Paul G Falkowski Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA Department of Earth and Planetary Sciences, Rutgers University, New Brunswick, New Jersey, USA
Vikas Nanda Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers University, Piscataway, New Jersey, USA

Collapse

Aliperti L, Aptekmann AA, Farfañuk G, Couso LL, Soler-Bistué A, Sánchez IE. r/K selection of GC content in prokaryotes. Environ Microbiol 2023;25:3255-3268. [PMID: 37813828 DOI: 10.1111/1462-2920.16511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 09/16/2023] [Indexed: 10/11/2023]

Marcos-Zambrano LJ, López-Molina VM, Bakir-Gungor B, Frohme M, Karaduzovic-Hadziabdic K, Klammsteiner T, Ibrahimi E, Lahti L, Loncar-Turukalo T, Dhamo X, Simeon A, Nechyporenko A, Pio G, Przymus P, Sampri A, Trajkovik V, Lacruz-Pleguezuelos B, Aasmets O, Araujo R, Anagnostopoulos I, Aydemir Ö, Berland M, Calle ML, Ceci M, Duman H, Gündoğdu A, Havulinna AS, Kaka Bra KHN, Kalluci E, Karav S, Lode D, Lopes MB, May P, Nap B, Nedyalkova M, Paciência I, Pasic L, Pujolassos M, Shigdel R, Susín A, Thiele I, Truică CO, Wilmes P, Yilmaz E, Yousef M, Claesson MJ, Truu J, Carrillo de Santa Pau E. A toolbox of machine learning software to support microbiome analysis. Front Microbiol 2023;14:1250806. [PMID: 38075858 PMCID: PMC10704913 DOI: 10.3389/fmicb.2023.1250806] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/11/2023] [Indexed: 05/14/2025] Open

Abstract

The human microbiome has become an area of intense research due to its potential impact on human health. However, the analysis and interpretation of this data have proven to be challenging due to its complexity and high dimensionality. Machine learning (ML) algorithms can process vast amounts of data to uncover informative patterns and relationships within the data, even with limited prior knowledge. Therefore, there has been a rapid growth in the development of software specifically designed for the analysis and interpretation of microbiome data using ML techniques. These software incorporate a wide range of ML algorithms for clustering, classification, regression, or feature selection, to identify microbial patterns and relationships within the data and generate predictive models. This rapid development with a constant need for new developments and integration of new features require efforts into compile, catalog and classify these tools to create infrastructures and services with easy, transparent, and trustable standards. Here we review the state-of-the-art for ML tools applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on ML based software and framework resources currently available for the analysis of microbiome data in humans. The aim is to support microbiologists and biomedical scientists to go deeper into specialized resources that integrate ML techniques and facilitate future benchmarking to create standards for the analysis of microbiome data. The software resources are organized based on the type of analysis they were developed for and the ML techniques they implement. A description of each software with examples of usage is provided including comments about pitfalls and lacks in the usage of software based on ML methods in relation to microbiome data that need to be considered by developers and users. This review represents an extensive compilation to date, offering valuable insights and guidance for researchers interested in leveraging ML approaches for microbiome analysis.

Collapse

Affiliation(s)

Laura Judith Marcos-Zambrano Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
Víctor Manuel López-Molina Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
Burcu Bakir-Gungor Department of Computer Engineering, Abdullah Gül University, Kayseri, Türkiye
Marcus Frohme Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
Kanita Karaduzovic-Hadziabdic Faculty of Engineering and Natural Sciences, International University of Sarajevo, Sarajevo, Bosnia and Herzegovina
Thomas Klammsteiner Department of Microbiology and Department of Ecology, University of Innsbruck, Innsbruck, Austria
Eliana Ibrahimi Department of Biology, University of Tirana, Tirana, Albania
Leo Lahti Department of Computing, University of Turku, Turku, Finland
Tatjana Loncar-Turukalo Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia
Xhilda Dhamo Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
Andrea Simeon BioSense Institute, University of Novi Sad, Novi Sad, Serbia
Alina Nechyporenko Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany Department of Systems Engineering, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine
Gianvito Pio Department of Computer Science, University of Bari Aldo Moro, Bari, Italy Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
Piotr Przymus Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
Alexia Sampri Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, United Kingdom
Vladimir Trajkovik Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
Blanca Lacruz-Pleguezuelos Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
Oliver Aasmets Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
Ricardo Araujo Nephrology and Infectious Diseases R & D Group, i3S—Instituto de Investigação e Inovação em Saúde; INEB—Instituto de Engenharia Biomédica, Universidade do Porto, Porto, Portugal
Ioannis Anagnostopoulos Department of Informatics, University of Piraeus, Piraeus, Greece Computer Science and Biomedical Informatics Department, University of Thessaly, Lamia, Greece
Önder Aydemir Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Türkiye
Magali Berland INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
M. Luz Calle Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain IRIS-CC, Fundació Institut de Recerca i Innovació en Ciències de la Vida i la Salut a la Catalunya Central, Vic, Barcelona, Spain
Michelangelo Ceci Department of Computer Science, University of Bari Aldo Moro, Bari, Italy Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
Hatice Duman Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
Aycan Gündoğdu Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Türkiye Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Türkiye
Aki S. Havulinna Finnish Institute for Health and Welfare - THL, Helsinki, Finland Institute for Molecular Medicine Finland, FIMM-HiLIFE, Helsinki, Finland
Kardokh Hama Najib Kaka Bra Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
Eglantina Kalluci Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
Sercan Karav Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
Daniel Lode Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
Marta B. Lopes Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
Patrick May Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
Bram Nap School of Medicine, University of Galway, Galway, Ireland
Miroslava Nedyalkova Department of Inorganic Chemistry, Faculty of Chemistry and Pharmacy, University of Sofia, Sofia, Bulgaria
Inês Paciência Center for Environmental and Respiratory Health Research (CERH), Research Unit of Population Health, University of Oulu, Oulu, Finland Biocenter Oulu, University of Oulu, Oulu, Finland
Lejla Pasic Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
Meritxell Pujolassos Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
Rajesh Shigdel Department of Clinical Science, University of Bergen, Bergen, Norway
Antonio Susín Mathematical Department, UPC-Barcelona Tech, Barcelona, Spain
Ines Thiele School of Medicine, University of Galway, Galway, Ireland APC Microbiome Ireland, University College Cork, Cork, Ireland
Ciprian-Octavian Truică Computer Science and Engineering Department, Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica, Bucharest, Romania
Paul Wilmes Systems Ecology Group, Luxembourg Centre for Systems Biomedicine, Esch-sur-Alzette, Luxembourg Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, Belvaux, Luxembourg
Ercument Yilmaz Department of Computer Technologies, Karadeniz Technical University, Trabzon, Türkiye
Malik Yousef Department of Information Systems, Zefat Academic College, Zefat, Israel Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
Marcus Joakim Claesson APC Microbiome Ireland, University College Cork, Cork, Ireland School of Microbiology, University College Cork, Cork, Ireland
Jaak Truu Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
Enrique Carrillo de Santa Pau Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain

Collapse

Ma B, Lu C, Wang Y, Yu J, Zhao K, Xue R, Ren H, Lv X, Pan R, Zhang J, Zhu Y, Xu J. A genomic catalogue of soil microbiomes boosts mining of biodiversity and genetic resources. Nat Commun 2023;14:7318. [PMID: 37951952 PMCID: PMC10640626 DOI: 10.1038/s41467-023-43000-z] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 10/27/2023] [Indexed: 11/14/2023] Open

Affiliation(s)

Bin Ma Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
Caiyu Lu Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
Yiling Wang Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
Jingwen Yu ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
Kankan Zhao Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China
Ran Xue ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
Hao Ren ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
Xiaofei Lv Department of Environmental Engineering, China Jiliang University, Hangzhou, 310018, China
Ronghui Pan ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
Jiabao Zhang State Key Laboratory of Soil and Sustainable Agriculture, Institute of Soil Science, Chinese Academy of Sciences, Nanjing, 210008, China
Yongguan Zhu Research Center for Eco-environmental Sciences, Chinese Academy of Sciences, Beijing, 100085, China
Jianming Xu Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, 310058, China. Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, 310058, China.

Collapse

Benegas G, Batra SS, Song YS. DNA language models are powerful predictors of genome-wide variant effects. Proc Natl Acad Sci U S A 2023;120:e2311219120. [PMID: 37883436 PMCID: PMC10622914 DOI: 10.1073/pnas.2311219120] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 09/08/2023] [Indexed: 10/28/2023] Open

Mahlich Y, Zhu C, Chung H, Velaga PK, De Paolis Kaluza M, Radivojac P, Friedberg I, Bromberg Y. Learning from the unknown: exploring the range of bacterial functionality. Nucleic Acids Res 2023;51:10162-10175. [PMID: 37739408 PMCID: PMC10602916 DOI: 10.1093/nar/gkad757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 09/11/2023] [Indexed: 09/24/2023] Open

Medina-Chávez NO, Viladomat-Jasso M, Zarza E, Islas-Robles A, Valdivia-Anistro J, Thalasso-Siret F, Eguiarte LE, Olmedo-Álvarez G, Souza V, De la Torre-Zavala S. A Transiently Hypersaline Microbial Mat Harbors a Diverse and Stable Archaeal Community in the Cuatro Cienegas Basin, Mexico. ASTROBIOLOGY 2023;23:796-811. [PMID: 37279013 DOI: 10.1089/ast.2021.0047] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

Lu J, Xiong R, Tian J, Wang C, Sun F. Deep learning to estimate lithium-ion battery state of health without additional degradation experiments. Nat Commun 2023;14:2760. [PMID: 37179411 PMCID: PMC10183024 DOI: 10.1038/s41467-023-38458-w] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 05/03/2023] [Indexed: 05/15/2023] Open

'Small Data' for big insights in ecology. Trends Ecol Evol 2023:S0169-5347(23)00019-8. [PMID: 36797167 DOI: 10.1016/j.tree.2023.01.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 01/18/2023] [Accepted: 01/25/2023] [Indexed: 02/17/2023]

Yang X, Qin S, Liu X, Zhang N, Chen J, Jin M, Liu F, Wang Y, Guo J, Shi H, Wang C, Chen Y. Meta-Viromic Sequencing Reveals Virome Characteristics of Mosquitoes and Culicoides on Zhoushan Island, China. Microbiol Spectr 2023;11:e0268822. [PMID: 36651764 PMCID: PMC9927462 DOI: 10.1128/spectrum.02688-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

Abstract

Mosquitoes and biting Culicoides species are arbovirus vectors. Effective virome profile surveillance is essential for the prevention and control of insect-borne diseases. From June to September 2021, we collected eight species of female mosquito and Culicoides on Zhoushan Island, China, and used meta-viromic sequencing to analyze their virome compositions and characteristics. The classified virus reads were distributed in 191 genera in 66 families. The virus sequences in mosquitoes with the largest proportions were Iflaviridae (30.03%), Phasmaviridae (23.09%), Xinmoviridae (21.82%), Flaviviridae (13.44%), and Rhabdoviridae (8.40%). Single-strand RNA⁺ viruses formed the largest proportions of viruses in all samples. Blood meals indicated that blood-sucking mosquito hosts were mainly chicken, duck, pig, and human, broadly consistent with the habitats where the mosquitoes were collected. Novel viruses of the Orthobunyavirus, Narnavirus, and Iflavirus genera were found in Culicoides by de-novo assembly. The viruses with vertebrate hosts carried by mosquitoes and Culicoides also varied widely. The analysis of unclassified viruses and deep-learning analysis of the "dark matter" in the meta-viromic sequencing data revealed the presence of a large number of unknown viruses. IMPORTANCE The monitoring of the viromes of mosquitoes and Culicoides, widely distributed arbovirus transmission vectors, is crucial to evaluate the risk of infectious disease transmission. In this study, the compositions of the viromes of mosquitoes and Culicoides on Zhoushan Island varied widely and were related mainly to the host species, with different host species having different core viromes. and many unknown sequences in the Culicoides viromes remain to be annotated, suggesting the presence of a large number of unknown viruses.

Collapse