1
|
Feng X, Liu Z, Mo Y, Zhang S, Ma XX. Role of nucleotide pair frequency and synonymous codon usage in the evolution of bovine viral diarrhea virus. Arch Virol 2025; 170:64. [PMID: 40011265 DOI: 10.1007/s00705-025-06250-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 11/26/2024] [Indexed: 02/28/2025]
Abstract
Synonymous codon usage plays an important role in the adaptation of viruses to their hosts. Bovine viral diarrhea virus (BVDV) relies on a high mutation rate in its genome to achieve the necessary fitness in a particular host. However, the question of which selective forces influence nucleotide pair and synonymous codon usage patterns in different BVDV genotypes remains unresolved. Here, 169 BVDV strains isolated at different times in various countries were analyzed to compare their dinucleotide frequency and synonymous codon usage. Examination of the nucleotide usage pattern in the open reading frame (ORF) of BVDV revealed a significantly higher frequency of purine than pyrimidine, with the highest extent of nucleotide usage bias observed in the first codon position. Moreover, a nucleotide pair bias, especially favoring CpG dinucleotides, was observed in all of the genotypes. Together, the nucleotide composition constraints and nucleotide pair bias appear to have influenced the overall codon usage pattern. Nucleotide pair and synonymous codon usage biases were associated with individual genotypes to different degrees. Of particular note, BVDV-1 exhibited more variation in its nucleotide pair and synonymous codon usage than BVDV-2 and BVDV-3, suggesting that these patterns are shaped both by selection of mutations in the viral genome and translational selection in the host.
Collapse
Affiliation(s)
- Xili Feng
- Key Laboratory of Biotechnology and Bioengineering of State Ethnic Affairs Commission, Biomedical Research Center, Northwest Minzu University, Lanzhou, 730030, China
- Key Laboratory of Special Animal Epidemic Disease, Ministry of Agriculture, Institute of Special Animal and Plant Sciences, Chinese Academy of Agricultural Sciences, Changchun, China
| | - Zeyu Liu
- Key Laboratory of Biotechnology and Bioengineering of State Ethnic Affairs Commission, Biomedical Research Center, Northwest Minzu University, Lanzhou, 730030, China
| | - Yongli Mo
- Key Laboratory of Biotechnology and Bioengineering of State Ethnic Affairs Commission, Biomedical Research Center, Northwest Minzu University, Lanzhou, 730030, China
| | - Shubin Zhang
- Key Laboratory of Biotechnology and Bioengineering of State Ethnic Affairs Commission, Biomedical Research Center, Northwest Minzu University, Lanzhou, 730030, China
| | - Xiao-Xia Ma
- Key Laboratory of Biotechnology and Bioengineering of State Ethnic Affairs Commission, Biomedical Research Center, Northwest Minzu University, Lanzhou, 730030, China.
| |
Collapse
|
2
|
Yin H, Wu S, Tan J, Guo Q, Li M, Guo J, Wang Y, Jiang X, Zhu H. IPEV: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning. Gigascience 2024; 13:giae018. [PMID: 38649300 PMCID: PMC11034026 DOI: 10.1093/gigascience/giae018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 03/14/2024] [Accepted: 03/25/2024] [Indexed: 04/25/2024] Open
Abstract
BACKGROUND The virome obtained through virus-like particle enrichment contains a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial to understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses. FINDINGS We present IPEV, a novel method to distinguish prokaryotic and eukaryotic viruses in viromes, with a 2-dimensional convolutional neural network combining trinucleotide pair relative distance and frequency. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in accuracy on marine and gut virome samples based on annotations by sequence alignments. IPEV reduces runtime by at most 1,225 times compared to existing methods under the same computing configuration. We also utilized IPEV to analyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals. CONCLUSIONS IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.
Collapse
Affiliation(s)
- Hengchuang Yin
- Department of Biomedical Engineering, College of Future Technology, and Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Shufang Wu
- Department of Biomedical Engineering, College of Future Technology, and Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Jie Tan
- Department of Biomedical Engineering, College of Future Technology, and Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Qian Guo
- Department of Biomedical Engineering, College of Future Technology, and Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Mo Li
- Department of Biomedical Engineering, College of Future Technology, and Center for Quantitative Biology, Peking University, Beijing 100871, China
- School of Life Sciences, Peking University, Beijing 100871, China
| | - Jinyuan Guo
- Department of Biomedical Engineering, College of Future Technology, and Center for Quantitative Biology, Peking University, Beijing 100871, China
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| | - Yaqi Wang
- Department of Biomedical Engineering, College of Future Technology, and Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Xiaoqing Jiang
- Department of Biomedical Engineering, College of Future Technology, and Center for Quantitative Biology, Peking University, Beijing 100871, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation, Beijing 100101, China
| | - Huaiqiu Zhu
- Department of Biomedical Engineering, College of Future Technology, and Center for Quantitative Biology, Peking University, Beijing 100871, China
- School of Life Sciences, Peking University, Beijing 100871, China
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| |
Collapse
|
3
|
Ding Y, Zhao L, Wang G, Shi Y, Guo G, Liu C, Chen Z, Coker OO, She J, Yu J. PacBio sequencing of human fecal samples uncovers the DNA methylation landscape of 22 673 gut phages. Nucleic Acids Res 2023; 51:12140-12149. [PMID: 37904586 PMCID: PMC10711547 DOI: 10.1093/nar/gkad977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 10/03/2023] [Accepted: 10/18/2023] [Indexed: 11/01/2023] Open
Abstract
Gut phages have an important impact on human health. Methylation plays key roles in DNA recognition, gene expression regulation and replication for phages. However, the DNA methylation landscape of gut phages is largely unknown. Here, with PacBio sequencing (2120×, 4785 Gb), we detected gut phage methylation landscape based on 22 673 gut phage genomes, and presented diverse methylation motifs and methylation differences in genomic elements. Moreover, the methylation rate of phages was associated with taxonomy and host, and N6-methyladenine methylation rate was higher in temperate phages than in virulent phages, suggesting an important role for methylation in phage-host interaction. In particular, 3543 (15.63%) phage genomes contained restriction-modification system, which could aid in evading clearance by the host. This study revealed the DNA methylation landscape of gut phage and its potential roles, which will advance the understanding of gut phage survival and human health.
Collapse
Affiliation(s)
- Yanqiang Ding
- Institute of Digestive Disease and Department of Medicine and Therapeutics, State Key Laboratory of Digestive Disease, Li Ka Shing Institute of Health Sciences, CUHK-Shenzhen Research Institute, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Liuyang Zhao
- Institute of Digestive Disease and Department of Medicine and Therapeutics, State Key Laboratory of Digestive Disease, Li Ka Shing Institute of Health Sciences, CUHK-Shenzhen Research Institute, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Guoping Wang
- Institute of Digestive Disease and Department of Medicine and Therapeutics, State Key Laboratory of Digestive Disease, Li Ka Shing Institute of Health Sciences, CUHK-Shenzhen Research Institute, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Yu Shi
- Institute of Digestive Disease and Department of Medicine and Therapeutics, State Key Laboratory of Digestive Disease, Li Ka Shing Institute of Health Sciences, CUHK-Shenzhen Research Institute, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Gang Guo
- Center for Gut Microbiome Research, Department of Surgery, Med-X Institute, Department of High Talent, The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an, China
| | - Changan Liu
- Institute of Digestive Disease and Department of Medicine and Therapeutics, State Key Laboratory of Digestive Disease, Li Ka Shing Institute of Health Sciences, CUHK-Shenzhen Research Institute, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Zigui Chen
- Department of Microbiology, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Olabisi Oluwabukola Coker
- Institute of Digestive Disease and Department of Medicine and Therapeutics, State Key Laboratory of Digestive Disease, Li Ka Shing Institute of Health Sciences, CUHK-Shenzhen Research Institute, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Junjun She
- Center for Gut Microbiome Research, Department of Surgery, Med-X Institute, Department of High Talent, The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an, China
| | - Jun Yu
- Institute of Digestive Disease and Department of Medicine and Therapeutics, State Key Laboratory of Digestive Disease, Li Ka Shing Institute of Health Sciences, CUHK-Shenzhen Research Institute, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
4
|
Gonzalez-Isunza G, Jawaid MZ, Liu P, Cox DL, Vazquez M, Arsuaga J. Using machine learning to detect coronaviruses potentially infectious to humans. Sci Rep 2023; 13:9319. [PMID: 37291260 PMCID: PMC10248971 DOI: 10.1038/s41598-023-35861-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 05/24/2023] [Indexed: 06/10/2023] Open
Abstract
Establishing the host range for novel viruses remains a challenge. Here, we address the challenge of identifying non-human animal coronaviruses that may infect humans by creating an artificial neural network model that learns from spike protein sequences of alpha and beta coronaviruses and their binding annotation to their host receptor. The proposed method produces a human-Binding Potential (h-BiP) score that distinguishes, with high accuracy, the binding potential among coronaviruses. Three viruses, previously unknown to bind human receptors, were identified: Bat coronavirus BtCoV/133/2005 and Pipistrellus abramus bat coronavirus HKU5-related (both MERS related viruses), and Rhinolophus affinis coronavirus isolate LYRa3 (a SARS related virus). We further analyze the binding properties of BtCoV/133/2005 and LYRa3 using molecular dynamics. To test whether this model can be used for surveillance of novel coronaviruses, we re-trained the model on a set that excludes SARS-CoV-2 and all viral sequences released after the SARS-CoV-2 was published. The results predict the binding of SARS-CoV-2 with a human receptor, indicating that machine learning methods are an excellent tool for the prediction of host expansion events.
Collapse
Affiliation(s)
| | - M Zaki Jawaid
- Department of Physics, University of California, Davis, USA
| | - Pengyu Liu
- Department of Microbiology and Molecular Genetics, University of California, Davis, CA, USA
| | - Daniel L Cox
- Department of Physics, University of California, Davis, USA
| | - Mariel Vazquez
- Department of Microbiology and Molecular Genetics, University of California, Davis, CA, USA
- Department of Mathematics, University of California, Davis, CA, USA
| | - Javier Arsuaga
- Department of Molecular and Cellular Biology, University of California, Davis, CA, USA.
- Department of Mathematics, University of California, Davis, CA, USA.
| |
Collapse
|
5
|
Sadad T, Aurangzeb RA, Safran M, Alfarhood S, Kim J. Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models. Biomedicines 2023; 11:biomedicines11051323. [PMID: 37238994 DOI: 10.3390/biomedicines11051323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 04/18/2023] [Accepted: 04/25/2023] [Indexed: 05/28/2023] Open
Abstract
Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but new viruses may be completely or partially divergent from the reference genome, rendering statistical methods and similarity calculations insufficient for all genome sequences. Identifying DNA/RNA-based viral sequences is a crucial step in differentiating different types of lethal pathogens, including their variants and strains. While various tools in bioinformatics can align them, expert biologists are required to interpret the results. Computational virology is a scientific field that studies viruses, their origins, and drug discovery, where machine learning plays a crucial role in extracting domain- and task-specific features to tackle this challenge. This paper proposes a genome analysis system that uses advanced deep learning to identify dozens of viruses. The system uses nucleotide sequences from the NCBI GenBank database and a BERT tokenizer to extract features from the sequences by breaking them down into tokens. We also generated synthetic data for viruses with small sample sizes. The proposed system has two components: a scratch BERT architecture specifically designed for DNA analysis, which is used to learn the next codons unsupervised, and a classifier that identifies important features and understands the relationship between genotype and phenotype. Our system achieved an accuracy of 97.69% in identifying viral sequences.
Collapse
Affiliation(s)
- Tariq Sadad
- Department of Computer Science, University of Engineering & Technology, Mardan 23200, Pakistan
| | - Raja Atif Aurangzeb
- Department of Computer Science & Software Engineering, International Islamic University Islamabad, Islamabad 44000, Pakistan
| | - Mejdl Safran
- Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
| | - Sultan Alfarhood
- Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
| | - Jungsuk Kim
- Department of Biomedical Engineering, Gachon University, Seongnam-si 13120, Republic of Korea
| |
Collapse
|
6
|
Roux S, Camargo AP, Coutinho FH, Dabdoub SM, Dutilh BE, Nayfach S, Tritt A. iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol 2023; 21:e3002083. [PMID: 37083735 PMCID: PMC10155999 DOI: 10.1371/journal.pbio.3002083] [Citation(s) in RCA: 117] [Impact Index Per Article: 58.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 05/03/2023] [Accepted: 03/15/2023] [Indexed: 04/22/2023] Open
Abstract
The extraordinary diversity of viruses infecting bacteria and archaea is now primarily studied through metagenomics. While metagenomes enable high-throughput exploration of the viral sequence space, metagenome-derived sequences lack key information compared to isolated viruses, in particular host association. Different computational approaches are available to predict the host(s) of uncultivated viruses based on their genome sequences, but thus far individual approaches are limited either in precision or in recall, i.e., for a number of viruses they yield erroneous predictions or no prediction at all. Here, we describe iPHoP, a two-step framework that integrates multiple methods to reliably predict host taxonomy at the genus rank for a broad range of viruses infecting bacteria and archaea, while retaining a low false discovery rate. Based on a large dataset of metagenome-derived virus genomes from the IMG/VR database, we illustrate how iPHoP can provide extensive host prediction and guide further characterization of uncultivated viruses.
Collapse
Affiliation(s)
- Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Antonio Pedro Camargo
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | | | - Shareef M Dabdoub
- Division of Biostatistics and Computational Biology, University of Iowa College of Dentistry, Iowa City, Iowa, United States of America
| | - Bas E Dutilh
- Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University, Jena, Germany
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, the Netherlands
| | - Stephen Nayfach
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Andrew Tritt
- Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| |
Collapse
|
7
|
Pu F, Wang R, Yang X, Hu X, Wang J, Zhang L, Zhao Y, Zhang D, Liu Z, Liu J. Nucleotide and codon usage biases involved in the evolution of African swine fever virus: A comparative genomics analysis. J Basic Microbiol 2023; 63:499-518. [PMID: 36782108 DOI: 10.1002/jobm.202200624] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 01/05/2023] [Accepted: 01/21/2023] [Indexed: 02/15/2023]
Abstract
Since African swine fever virus (ASFV) replication is closely related to its host's machinery, codon usage of viral genome can be subject to selection pressures. A better understanding of codon usage can give new insights into viral evolution. We implemented information entropy and revealed that the nucleotide usage pattern of ASFV is significantly associated with viral isolation factors (region and time), especially the usages of thymine and cytosine. Despite the domination of adenine and thymine in the viral genome, we found that mutation pressure alters the overall codon usage pattern of ASFV, followed by selective forces from natural selection. Moreover, the nucleotide skew index at the gene level indicates that nucleotide usages influencing synonymous codon bias of ASFV are significantly correlated with viral protein hydropathy. Finally, evolutionary plasticity is proved to contribute to the weakness in synonymous codons with A- or T-end serving as optimal codons of ASFV, suggesting that fine-tuning translation selection plays a role in synonymous codon usages of ASFV for adapting host. Taken together, ASFV is subject to evolutionary dynamics on nucleotide selections and synonymous codon usage, and our detailed analysis offers deeper insights into the genetic characteristics of this newly emerging virus around the world.
Collapse
Affiliation(s)
- Feiyang Pu
- Biomedical Research Center, Northwest Minzu University, Lanzhou, China.,College of Life Science and Engineering, Northwest Minzu University, Lanzhou, Gansu, China
| | - Rui Wang
- Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Xuanye Yang
- Biomedical Research Center, Northwest Minzu University, Lanzhou, China.,College of Life Science and Engineering, Northwest Minzu University, Lanzhou, Gansu, China
| | - Xinyan Hu
- Biomedical Research Center, Northwest Minzu University, Lanzhou, China.,College of Life Science and Engineering, Northwest Minzu University, Lanzhou, Gansu, China
| | - Jinqian Wang
- Biomedical Research Center, Northwest Minzu University, Lanzhou, China.,College of Life Science and Engineering, Northwest Minzu University, Lanzhou, Gansu, China
| | - Lijuan Zhang
- College of Life Science and Engineering, Northwest Minzu University, Lanzhou, Gansu, China
| | - Yongqing Zhao
- Biomedical Research Center, Northwest Minzu University, Lanzhou, China.,College of Life Science and Engineering, Northwest Minzu University, Lanzhou, Gansu, China
| | - Derong Zhang
- Biomedical Research Center, Northwest Minzu University, Lanzhou, China.,College of Life Science and Engineering, Northwest Minzu University, Lanzhou, Gansu, China
| | - Zewen Liu
- Biomedical Research Center, Northwest Minzu University, Lanzhou, China.,College of Life Science and Engineering, Northwest Minzu University, Lanzhou, Gansu, China
| | - Junlin Liu
- Biomedical Research Center, Northwest Minzu University, Lanzhou, China.,College of Life Science and Engineering, Northwest Minzu University, Lanzhou, Gansu, China
| |
Collapse
|
8
|
Bajiya N, Dhall A, Aggarwal S, Raghava GPS. Advances in the field of phage-based therapy with special emphasis on computational resources. Brief Bioinform 2023; 24:6961791. [PMID: 36575815 DOI: 10.1093/bib/bbac574] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 11/07/2022] [Accepted: 11/25/2022] [Indexed: 12/29/2022] Open
Abstract
In the current era, one of the major challenges is to manage the treatment of drug/antibiotic-resistant strains of bacteria. Phage therapy, a century-old technique, may serve as an alternative to antibiotics in treating bacterial infections caused by drug-resistant strains of bacteria. In this review, a systematic attempt has been made to summarize phage-based therapy in depth. This review has been divided into the following two sections: general information and computer-aided phage therapy (CAPT). In the case of general information, we cover the history of phage therapy, the mechanism of action, the status of phage-based products (approved and clinical trials) and the challenges. This review emphasizes CAPT, where we have covered primary phage-associated resources, phage prediction methods and pipelines. This review covers a wide range of databases and resources, including viral genomes and proteins, phage receptors, host genomes of phages, phage-host interactions and lytic proteins. In the post-genomic era, identifying the most suitable phage for lysing a drug-resistant strain of bacterium is crucial for developing alternate treatments for drug-resistant bacteria and this remains a challenging problem. Thus, we compile all phage-associated prediction methods that include the prediction of phages for a bacterial strain, the host for a phage and the identification of interacting phage-host pairs. Most of these methods have been developed using machine learning and deep learning techniques. This review also discussed recent advances in the field of CAPT, where we briefly describe computational tools available for predicting phage virions, the life cycle of phages and prophage identification. Finally, we describe phage-based therapy's advantages, challenges and opportunities.
Collapse
Affiliation(s)
- Nisha Bajiya
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| | - Suchet Aggarwal
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| |
Collapse
|
9
|
Iuchi H, Kawasaki J, Kubo K, Fukunaga T, Hokao K, Yokoyama G, Ichinose A, Suga K, Hamada M. Bioinformatics approaches for unveiling virus-host interactions. Comput Struct Biotechnol J 2023; 21:1774-1784. [PMID: 36874163 PMCID: PMC9969756 DOI: 10.1016/j.csbj.2023.02.044] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 02/22/2023] [Accepted: 02/22/2023] [Indexed: 03/03/2023] Open
Abstract
The coronavirus disease-2019 (COVID-19) pandemic has elucidated major limitations in the capacity of medical and research institutions to appropriately manage emerging infectious diseases. We can improve our understanding of infectious diseases by unveiling virus-host interactions through host range prediction and protein-protein interaction prediction. Although many algorithms have been developed to predict virus-host interactions, numerous issues remain to be solved, and the entire network remains veiled. In this review, we comprehensively surveyed algorithms used to predict virus-host interactions. We also discuss the current challenges, such as dataset biases toward highly pathogenic viruses, and the potential solutions. The complete prediction of virus-host interactions remains difficult; however, bioinformatics can contribute to progress in research on infectious diseases and human health.
Collapse
Affiliation(s)
- Hitoshi Iuchi
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan.,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
| | - Junna Kawasaki
- Faculty of Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Kento Kubo
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan.,School of Advanced Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Nishi Waseda, Shinjuku-ku, Tokyo 169-0051, Japan
| | - Koki Hokao
- School of Advanced Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Gentaro Yokoyama
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan.,School of Advanced Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Akiko Ichinose
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Kanta Suga
- School of Advanced Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Michiaki Hamada
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan.,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan.,School of Advanced Science and Engineering, Waseda University, Okubo Shinjuku-ku, Tokyo 169-8555, Japan.,Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan
| |
Collapse
|
10
|
Bartoszewicz JM, Nasri F, Nowicka M, Renard BY. Detecting DNA of novel fungal pathogens using ResNets and a curated fungi-hosts data collection. Bioinformatics 2022; 38:ii168-ii174. [PMID: 36124807 DOI: 10.1093/bioinformatics/btac495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/08/2022] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Emerging pathogens are a growing threat, but large data collections and approaches for predicting the risk associated with novel agents are limited to bacteria and viruses. Pathogenic fungi, which also pose a constant threat to public health, remain understudied. Relevant data remain comparatively scarce and scattered among many different sources, hindering the development of sequencing-based detection workflows for novel fungal pathogens. No prediction method working for agents across all three groups is available, even though the cause of an infection is often difficult to identify from symptoms alone. RESULTS We present a curated collection of fungal host range data, comprising records on human, animal and plant pathogens, as well as other plant-associated fungi, linked to publicly available genomes. We show that it can be used to predict the pathogenic potential of novel fungal species directly from DNA sequences with either sequence homology or deep learning. We develop learned, numerical representations of the collected genomes and visualize the landscape of fungal pathogenicity. Finally, we train multi-class models predicting if next-generation sequencing reads originate from novel fungal, bacterial or viral threats. CONCLUSIONS The neural networks trained using our data collection enable accurate detection of novel fungal pathogens. A curated set of over 1400 genomes with host and pathogenicity metadata supports training of machine-learning models and sequence comparison, not limited to the pathogen detection task. AVAILABILITY AND IMPLEMENTATION The data, models and code are hosted at https://zenodo.org/record/5846345, https://zenodo.org/record/5711877 and https://gitlab.com/dacs-hpi/deepac. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jakub M Bartoszewicz
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.,Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany
| | - Ferdous Nasri
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.,Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany
| | - Melania Nowicka
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany.,Department of Mathematics and Computer Science, Free University of Berlin, Berlin 14195, Germany
| | - Bernhard Y Renard
- Hasso Plattner Institute for Digital Engineering, Digital Engineering Faculty, University of Potsdam, Potsdam 14482, Germany
| |
Collapse
|
11
|
Andrade-Martínez JS, Camelo Valera LC, Chica Cárdenas LA, Forero-Junco L, López-Leal G, Moreno-Gallego JL, Rangel-Pineros G, Reyes A. Computational Tools for the Analysis of Uncultivated Phage Genomes. Microbiol Mol Biol Rev 2022; 86:e0000421. [PMID: 35311574 PMCID: PMC9199400 DOI: 10.1128/mmbr.00004-21] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Over a century of bacteriophage research has uncovered a plethora of fundamental aspects of their biology, ecology, and evolution. Furthermore, the introduction of community-level studies through metagenomics has revealed unprecedented insights on the impact that phages have on a range of ecological and physiological processes. It was not until the introduction of viral metagenomics that we began to grasp the astonishing breadth of genetic diversity encompassed by phage genomes. Novel phage genomes have been reported from a diverse range of biomes at an increasing rate, which has prompted the development of computational tools that support the multilevel characterization of these novel phages based solely on their genome sequences. The impact of these technologies has been so large that, together with MAGs (Metagenomic Assembled Genomes), we now have UViGs (Uncultivated Viral Genomes), which are now officially recognized by the International Committee for the Taxonomy of Viruses (ICTV), and new taxonomic groups can now be created based exclusively on genomic sequence information. Even though the available tools have immensely contributed to our knowledge of phage diversity and ecology, the ongoing surge in software programs makes it challenging to keep up with them and the purpose each one is designed for. Therefore, in this review, we describe a comprehensive set of currently available computational tools designed for the characterization of phage genome sequences, focusing on five specific analyses: (i) assembly and identification of phage and prophage sequences, (ii) phage genome annotation, (iii) phage taxonomic classification, (iv) phage-host interaction analysis, and (v) phage microdiversity.
Collapse
Affiliation(s)
- Juan Sebastián Andrade-Martínez
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - Laura Carolina Camelo Valera
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - Luis Alberto Chica Cárdenas
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - Laura Forero-Junco
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- Department of Plant and Environmental Science, University of Copenhagen, Frederiksberg, Denmark
| | - Gamaliel López-Leal
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - J. Leonardo Moreno-Gallego
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- Department of Microbiome Science, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Guillermo Rangel-Pineros
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- The GLOBE Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Alejandro Reyes
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri, USA
| |
Collapse
|
12
|
Versoza CJ, Pfeifer SP. Computational Prediction of Bacteriophage Host Ranges. Microorganisms 2022; 10:149. [PMID: 35056598 PMCID: PMC8778386 DOI: 10.3390/microorganisms10010149] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Revised: 01/06/2022] [Accepted: 01/11/2022] [Indexed: 12/27/2022] Open
Abstract
Increased antibiotic resistance has prompted the development of bacteriophage agents for a multitude of applications in agriculture, biotechnology, and medicine. A key factor in the choice of agents for these applications is the host range of a bacteriophage, i.e., the bacterial genera, species, and strains a bacteriophage is able to infect. Although experimental explorations of host ranges remain the gold standard, such investigations are inherently limited to a small number of viruses and bacteria amendable to cultivation. Here, we review recently developed bioinformatic tools that offer a promising and high-throughput alternative by computationally predicting the putative host ranges of bacteriophages, including those challenging to grow in laboratory environments.
Collapse
Affiliation(s)
- Cyril J. Versoza
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ 85281, USA;
| | - Susanne P. Pfeifer
- Center for Mechanisms of Evolution, School of Life Sciences, Arizona State University, Tempe, AZ 85281, USA
| |
Collapse
|
13
|
Wu S, Fang Z, Tan J, Li M, Wang C, Guo Q, Xu C, Jiang X, Zhu H. DeePhage: distinguishing virulent and temperate phage-derived sequences in metavirome data with a deep learning approach. Gigascience 2021; 10:giab056. [PMID: 34498685 PMCID: PMC8427542 DOI: 10.1093/gigascience/giab056] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Prokaryotic viruses referred to as phages can be divided into virulent and temperate phages. Distinguishing virulent and temperate phage-derived sequences in metavirome data is important for elucidating their different roles in interactions with bacterial hosts and regulation of microbial communities. However, there is no experimental or computational approach to effectively classify their sequences in culture-independent metavirome. We present a new computational method, DeePhage, which can directly and rapidly judge each read or contig as a virulent or temperate phage-derived fragment. FINDINGS DeePhage uses a "one-hot" encoding form to represent DNA sequences in detail. Sequence signatures are detected via a convolutional neural network to obtain valuable local features. The accuracy of DeePhage on 5-fold cross-validation reaches as high as 89%, nearly 10% and 30% higher than that of 2 similar tools, PhagePred and PHACTS. On real metavirome, DeePhage correctly predicts the highest proportion of contigs when using BLAST as annotation, without apparent preferences. Besides, DeePhage reduces running time vs PhagePred and PHACTS by 245 and 810 times, respectively, under the same computational configuration. By direct detection of the temperate viral fragments from metagenome and metavirome, we furthermore propose a new strategy to explore phage transformations in the microbial community. The ability to detect such transformations provides us a new insight into the potential treatment for human disease. CONCLUSIONS DeePhage is a novel tool developed to rapidly and efficiently identify 2 kinds of phage fragments especially for metagenomics analysis. DeePhage is freely available via http://cqb.pku.edu.cn/ZhuLab/DeePhage or https://github.com/shufangwu/DeePhage.
Collapse
Affiliation(s)
- Shufang Wu
- State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing 100871, Beijing, China
| | - Zhencheng Fang
- State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing 100871, Beijing, China
| | - Jie Tan
- State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing 100871, Beijing, China
| | - Mo Li
- Peking University-Tsinghua University - National Institute of Biological Sciences (PTN) joint PhD program, School of Life Sciences, Peking University, Beijing 100871, Beijing, China
| | - Chunhui Wang
- Peking University-Tsinghua University - National Institute of Biological Sciences (PTN) joint PhD program, School of Life Sciences, Peking University, Beijing 100871, Beijing, China
| | - Qian Guo
- State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing 100871, Beijing, China
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, GA 30332, Atlanta, USA
| | - Congmin Xu
- State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing 100871, Beijing, China
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, GA 30332, Atlanta, USA
| | - Xiaoqing Jiang
- State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing 100871, Beijing, China
| | - Huaiqiu Zhu
- State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing 100871, Beijing, China
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, GA 30332, Atlanta, USA
- Institute of Medical Technology, Peking University Health Science Center, Beijing 100191, Beijing, China
| |
Collapse
|
14
|
Guo Q, Li M, Wang C, Guo J, Jiang X, Tan J, Wu S, Wang P, Xiao T, Zhou M, Fang Z, Xiao Y, Zhu H. Predicting hosts based on early SARS-CoV-2 samples and analyzing the 2020 pandemic. Sci Rep 2021; 11:17422. [PMID: 34465838 PMCID: PMC8408148 DOI: 10.1038/s41598-021-96903-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 08/18/2021] [Indexed: 11/16/2022] Open
Abstract
The SARS-CoV-2 pandemic has raised concerns in the identification of the hosts of the virus since the early stages of the outbreak. To address this problem, we proposed a deep learning method, DeepHoF, based on extracting viral genomic features automatically, to predict the host likelihood scores on five host types, including plant, germ, invertebrate, non-human vertebrate and human, for novel viruses. DeepHoF made up for the lack of an accurate tool, reaching a satisfactory AUC of 0.975 in the five-classification, and could make a reliable prediction for the novel viruses without close neighbors in phylogeny. Additionally, to fill the gap in the efficient inference of host species for SARS-CoV-2 using existing tools, we conducted a deep analysis on the host likelihood profile calculated by DeepHoF. Using the isolates sequenced in the earliest stage of the COVID-19 pandemic, we inferred that minks, bats, dogs and cats were potential hosts of SARS-CoV-2, while minks might be one of the most noteworthy hosts. Several genes of SARS-CoV-2 demonstrated their significance in determining the host range. Furthermore, a large-scale genome analysis, based on DeepHoF's computation for the later pandemic in 2020, disclosed the uniformity of host range among SARS-CoV-2 samples and the strong association of SARS-CoV-2 between humans and minks.
Collapse
Affiliation(s)
- Qian Guo
- State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China
- Center for Quantitative Biology, Peking University, Beijing, 100871, China
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, 30332, USA
| | - Mo Li
- Peking University-Tsinghua University-National Institute of Biological Sciences (PTN) Joint PhD Program, School of Life Sciences, Peking University, Beijing, 100871, China
| | - Chunhui Wang
- Peking University-Tsinghua University-National Institute of Biological Sciences (PTN) Joint PhD Program, School of Life Sciences, Peking University, Beijing, 100871, China
| | - Jinyuan Guo
- State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, 30332, USA
| | - Xiaoqing Jiang
- State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China
- Center for Quantitative Biology, Peking University, Beijing, 100871, China
- Institute of Medical Technology, Peking University Health Science Center, Beijing, 100191, China
| | - Jie Tan
- State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China
| | - Shufang Wu
- State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China
- Center for Quantitative Biology, Peking University, Beijing, 100871, China
| | - Peihong Wang
- State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China
| | - Tingting Xiao
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, 310006, China
| | - Man Zhou
- State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China
- Center for Quantitative Biology, Peking University, Beijing, 100871, China
| | - Zhencheng Fang
- State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China
- Center for Quantitative Biology, Peking University, Beijing, 100871, China
| | - Yonghong Xiao
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, 310006, China.
| | - Huaiqiu Zhu
- State Key Laboratory for Turbulence and Complex Systems, Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, 100871, China.
- Center for Quantitative Biology, Peking University, Beijing, 100871, China.
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, 30332, USA.
- Institute of Medical Technology, Peking University Health Science Center, Beijing, 100191, China.
| |
Collapse
|
15
|
Global overview and major challenges of host prediction methods for uncultivated phages. Curr Opin Virol 2021; 49:117-126. [PMID: 34126465 DOI: 10.1016/j.coviro.2021.05.003] [Citation(s) in RCA: 54] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 05/20/2021] [Accepted: 05/22/2021] [Indexed: 12/14/2022]
Abstract
Bacterial communities play critical roles across all of Earth's biomes, affecting human health and global ecosystem functioning. They do so under strong constraints exerted by viruses, that is, bacteriophages or 'phages'. Phages can reshape bacterial communities' structure, influence long-term evolution of bacterial populations, and alter host cell metabolism during infection. Metagenomics approaches, that is, shotgun sequencing of environmental DNA or RNA, recently enabled large-scale exploration of phage genomic diversity, yielding several millions of phage genomes now to be further analyzed and characterized. One major challenge however is the lack of direct host information for these phages. Several methods and tools have been proposed to bioinformatically predict the potential host(s) of uncultivated phages based only on genome sequence information. Here we review these different approaches and highlight their distinct strengths and limitations. We also outline complementary experimental assays which are being proposed to validate and refine these bioinformatic predictions.
Collapse
|
16
|
Bartoszewicz JM, Seidel A, Renard BY. Interpretable detection of novel human viruses from genome sequencing data. NAR Genom Bioinform 2021; 3:lqab004. [PMID: 33554119 PMCID: PMC7849996 DOI: 10.1093/nargab/lqab004] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 01/04/2021] [Accepted: 01/15/2021] [Indexed: 01/21/2023] Open
Abstract
Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.
Collapse
Affiliation(s)
- Jakub M Bartoszewicz
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
- Data Analytics and Computational Statistics, Hasso Plattner Institute for Digital Engineering, 14482 Potsdam, Brandenburg, Germany
- Digital Engineering Faculty, University of Postdam, 14482 Potsdam, Brandenburg, Germany
| | - Anja Seidel
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Bernhard Y Renard
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Data Analytics and Computational Statistics, Hasso Plattner Institute for Digital Engineering, 14482 Potsdam, Brandenburg, Germany
- Digital Engineering Faculty, University of Postdam, 14482 Potsdam, Brandenburg, Germany
| |
Collapse
|
17
|
Pons JC, Paez-Espino D, Riera G, Ivanova N, Kyrpides NC, Llabrés M. VPF-Class: Taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 2021; 37:1805-1813. [PMID: 33471063 PMCID: PMC8830756 DOI: 10.1093/bioinformatics/btab026] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 12/11/2020] [Accepted: 01/13/2021] [Indexed: 12/03/2022] Open
Abstract
Motivation Two key steps in the analysis of uncultured viruses recovered from metagenomes are the taxonomic classification of the viral sequences and the identification of putative host(s). Both steps rely mainly on the assignment of viral proteins to orthologs in cultivated viruses. Viral Protein Families (VPFs) can be used for the robust identification of new viral sequences in large metagenomics datasets. Despite the importance of VPF information for viral discovery, VPFs have not yet been explored for determining viral taxonomy and host targets. Results In this work, we classified the set of VPFs from the IMG/VR database and developed VPF-Class. VPF-Class is a tool that automates the taxonomic classification and host prediction of viral contigs based on the assignment of their proteins to a set of classified VPFs. Applying VPF-Class on 731K uncultivated virus contigs from the IMG/VR database, we were able to classify 363K contigs at the genus level and predict the host of over 461K contigs. In the RefSeq database, VPF-class reported an accuracy of nearly 100% to classify dsDNA, ssDNA and retroviruses, at the genus level, considering a membership ratio and a confidence score of 0.2. The accuracy in host prediction was 86.4%, also at the genus level, considering a membership ratio of 0.3 and a confidence score of 0.5. And, in the prophages dataset, the accuracy in host prediction was 86% considering a membership ratio of 0.6 and a confidence score of 0.8. Moreover, from the Global Ocean Virome dataset, over 817K viral contigs out of 1 million were classified. Availability and implementation The implementation of VPF-Class can be downloaded from https://github.com/biocom-uib/vpf-tools. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Joan Carles Pons
- Department of Mathematics and Computer Science, University of the Balearic Islands, Palma, 07122, Spain
| | | | - Gabriel Riera
- Department of Mathematics and Computer Science, University of the Balearic Islands, Palma, 07122, Spain
| | - Natalia Ivanova
- Department of Energy Joint Genome Institute, Berkeley, 94720, USA
| | - Nikos C Kyrpides
- Department of Energy Joint Genome Institute, Berkeley, 94720, USA
| | - Mercè Llabrés
- Department of Mathematics and Computer Science, University of the Balearic Islands, Palma, 07122, Spain
| |
Collapse
|
18
|
Khan Mirzaei M, Xue J, Costa R, Ru J, Schulz S, Taranu ZE, Deng L. Challenges of Studying the Human Virome - Relevant Emerging Technologies. Trends Microbiol 2020; 29:171-181. [PMID: 32622559 DOI: 10.1016/j.tim.2020.05.021] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2020] [Revised: 05/27/2020] [Accepted: 05/28/2020] [Indexed: 01/17/2023]
Abstract
In this review we provide an overview of current challenges and advances in bacteriophage research within the growing field of viromics. In particular, we discuss, from a human virome study perspective, the current and emerging technologies available, their limitations in terms of de novo discoveries, and possible solutions to overcome present experimental and computational biases associated with low abundance of viral DNA or RNA. We summarize recent breakthroughs in metagenomics assembling tools and single-cell analysis, which have the potential to increase our understanding of phage biology, diversity, and interactions with both the microbial community and the human body. We expect that these recent and future advances in the field of viromics will have a strong impact on how we develop phage-based therapeutic approaches.
Collapse
Affiliation(s)
- Mohammadali Khan Mirzaei
- Institute of Virology, Helmholtz Centre Munich and Technical University of Munich, Neuherberg, Bavaria 85764, Germany
| | - Jinling Xue
- Institute of Virology, Helmholtz Centre Munich and Technical University of Munich, Neuherberg, Bavaria 85764, Germany
| | - Rita Costa
- Institute of Virology, Helmholtz Centre Munich and Technical University of Munich, Neuherberg, Bavaria 85764, Germany
| | - Jinlong Ru
- Institute of Virology, Helmholtz Centre Munich and Technical University of Munich, Neuherberg, Bavaria 85764, Germany
| | - Sarah Schulz
- Institute of Virology, Helmholtz Centre Munich and Technical University of Munich, Neuherberg, Bavaria 85764, Germany
| | - Zofia E Taranu
- Aquatic Contaminants Research Division (ACRD), Environment and Climate Change Canada (ECCC), Montréal, QC H2Y 2E7, Canada
| | - Li Deng
- Institute of Virology, Helmholtz Centre Munich and Technical University of Munich, Neuherberg, Bavaria 85764, Germany.
| |
Collapse
|
19
|
Khot V, Strous M, Hawley AK. Computational approaches in viral ecology. Comput Struct Biotechnol J 2020; 18:1605-1612. [PMID: 32670501 PMCID: PMC7334295 DOI: 10.1016/j.csbj.2020.06.019] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2020] [Revised: 06/09/2020] [Accepted: 06/10/2020] [Indexed: 01/21/2023] Open
Abstract
Dynamic virus-host interactions play a critical role in regulating microbial community structure and function. Yet for decades prior to the genomics era, viruses were largely overlooked in microbial ecology research, as only low-throughput culture-based methods of discovering viruses were available. With the advent of metagenomics, culture-independent techniques have provided exciting opportunities to discover and study new viruses. Here, we review recently developed computational methods for identifying viral sequences, exploring viral diversity in environmental samples, and predicting hosts from metagenomic sequence data. Methods to analyze viruses in silico utilize unconventional approaches to tackle challenges unique to viruses, such as vast diversity, mosaic viral genomes, and the lack of universal marker genes. As the field of viral ecology expands exponentially, computational advances have become increasingly important to gain insight into the role viruses in diverse habitats.
Collapse
Affiliation(s)
- Varada Khot
- Department of Geoscience, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Marc Strous
- Department of Geoscience, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Alyse K. Hawley
- Department of Geoscience, University of Calgary, Calgary, AB T2N 1N4, Canada
| |
Collapse
|
20
|
Young F, Rogers S, Robertson DL. Predicting host taxonomic information from viral genomes: A comparison of feature representations. PLoS Comput Biol 2020; 16:e1007894. [PMID: 32453718 PMCID: PMC7307784 DOI: 10.1371/journal.pcbi.1007894] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 06/22/2020] [Accepted: 04/21/2020] [Indexed: 12/13/2022] Open
Abstract
The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representations of genomic information. Here we investigate the predictive potential of features generated from four different ‘levels’ of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. This more fully exploits the biological information present in the virus genomes. Over a hundred and eighty binary datasets for infecting versus non-infecting viruses at all taxonomic ranks of both eukaryote and prokaryote hosts were compiled. The viral genomes were converted into the four different levels of genome representation and twenty feature sets were generated by extracting k-mer compositions and predicted protein domains. We trained and tested Support Vector Machine, SVM, classifiers to compare the predictive capacity of each of these feature sets for each dataset. Our results show that all levels of genome representation are consistently predictive of host taxonomy and that prediction k-mer composition improves with increasing k-mer length for all k-mer based features. Using a phylogenetically aware holdout method, we demonstrate that the predictive feature sets contain signals reflecting both the evolutionary relationship between the viruses infecting related hosts, and host-mimicry. Our results demonstrate that incorporating a range of complementary features, generated purely from virus genome sequences, leads to improved accuracy for a range of virus host prediction tasks enabling computational assignment of host taxonomic information. Elucidating the host of a newly identified virus species is an important challenge, with applications from knowing the source species of a newly emerged pathogen to understanding the bacteriophage-host relationships within the microbiome of any of earth’s ecosystems. Current high throughput methods used to identify viruses within biological or environmental samples have resulted in an unprecedented increase in virus discovery. However, for the majority of these virus genomes the host species/taxonomic classification remains unknown. To address this gap in our knowledge there is a need for fast, accurate computational methods for the assignment of putative host taxonomic information. Machine learning is an ideal approach but to maximise predictive accuracy the viral genomes need to be represented in a format (sets of features) that makes the discriminative information available to the machine learning algorithm. Here, we compare different types of features derived from the same viral genomes for their ability to predict host information. Our results demonstrate that all these feature sets are predictive of host taxonomy and when combined have the potential to improve accuracy over the use of individual feature sets across many virus host prediction applications.
Collapse
Affiliation(s)
- Francesca Young
- MRC-University of Glasgow Centre For Virus Research, Glasgow, United Kingdom
| | - Simon Rogers
- School of Computing Science, University of Glasgow, Glasgow, United Kingdom
| | - David L. Robertson
- MRC-University of Glasgow Centre For Virus Research, Glasgow, United Kingdom
- * E-mail:
| |
Collapse
|