1
|
Consul S, Robertson J, Vikalo H. XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples. J Comput Biol 2025. [PMID: 40392695 DOI: 10.1089/cmb.2025.0075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2025] Open
Abstract
It is estimated that approximately 15% of cancers worldwide can be linked to viral infections. The viruses that can cause or increase the risk of cancer include human papillomavirus, hepatitis B and C viruses, Epstein-Barr virus, and human immunodeficiency virus, to name a few. The computational analysis of the massive amounts of tumor DNA data, whose collection is enabled by the advancements in sequencing technologies, has allowed studies of the potential association between cancers and viral pathogens. However, the high diversity of oncoviral families makes reliable detection of viral DNA difficult, and the training of machine learning models that enable such analysis computationally challenging. We introduce XVir, a data pipeline that deploys a transformer-based deep learning architecture to reliably identify viral DNA present in human tumors. XVir is trained on a mix of sequencing reads coming from viral and human genomes, resulting in a model capable of robust detection of potentially mutated viral DNA across a range of experimental settings. Results on semi-experimental data demonstrate that XVir is able to achieve high classification accuracy, generally outperforming state-of-the-art competing methods. In particular, it retains high accuracy even when faced with diverse viral populations while being significantly faster to train than other large deep learning-based classifiers.
Collapse
Affiliation(s)
- Shorya Consul
- Chandra Family Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas, USA
| | - John Robertson
- Chandra Family Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas, USA
| | - Haris Vikalo
- Chandra Family Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
2
|
Li J, Mi J, Lin W, Tian F, Wan J, Gao J, Tong Y. VirNucPro: an identifier for the identification of viral short sequences using six-frame translation and large language models. Brief Bioinform 2025; 26:bbaf224. [PMID: 40387494 PMCID: PMC12086996 DOI: 10.1093/bib/bbaf224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 03/26/2025] [Accepted: 04/26/2025] [Indexed: 05/20/2025] Open
Abstract
Viruses are ubiquitous in nature, yet our understanding of them remains limited. High-throughput sequencing technology facilitates the unbiased revelation of genetic composition in samples; however, viral sequences typically make up a small proportion of the entire sequencing data, making it challenging to accurately identify the few or fragmented viral sequences present in a sample. The limited features and information provided by short sequences result in insufficient resolution of viral sequences by existing models. Therefore, we propose a new model, VirNucPro, for short viral sequence identification. Based on a six-frame translation strategy and large language models, we combine nucleotide and amino acid sequence information to enhance feature extraction for short sequences, achieving high accuracy in identifying short viral sequences. Ablation experiments compared the contributions of nucleotide and amino acid sequence features to the model, confirming that the introduced amino acid features significantly contribute to the classification results. Our model outperforms others, such as GCNFrame, DeepVirFinder, DETIRE, and Virtifier, which have demonstrated good performance in identifying short viral sequences of 300 and 500 bp. Our model demonstrates excellent performance on carefully created real-world datasets. Additionally, it can scan for prophage regions within long bacterial fragments, offering a wide range of applications. The codes are available at: https://github.com/Li-Jing-1997/VirNucPro.
Collapse
Affiliation(s)
- Jing Li
- The College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
- The College of Life Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Jia Mi
- The College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Wei Lin
- The College of Life Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Fengjuan Tian
- The College of Life Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Jing Wan
- The College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Jingyang Gao
- The College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Yigang Tong
- The College of Life Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| |
Collapse
|
3
|
Nowicki M, Mroczek M, Mukhedkar D, Bała P, Nikolai Pimenoff V, Arroyo Mühr LS. HPV-KITE: sequence analysis software for rapid HPV genotype detection. Brief Bioinform 2025; 26:bbaf155. [PMID: 40205852 PMCID: PMC11982018 DOI: 10.1093/bib/bbaf155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 03/04/2025] [Accepted: 03/19/2025] [Indexed: 04/11/2025] Open
Abstract
Human papillomaviruses (HPVs) are among the most diverse viral families that infect humans. Fortunately, only a small number of closely related HPV types affect human health, most notably by causing nearly all cervical cancers, as well as some oral and other anogenital cancers, particularly when infections with high-risk HPV types become persistent. Numerous viral polymerase chain reaction-based diagnostic methods as well as sequencing protocols have been developed for accurate, rapid, and efficient HPV genotyping. However, due to the large number of closely related HPV genotypes and the abundance of nonviral DNA in human derived biological samples, it can be challenging to correctly detect HPV genotypes using high throughput deep sequencing. Here, we introduce a novel HPV detection algorithm, HPV-KITE (HPV K-mer Index Tversky Estimator), which leverages k-mer data analysis and utilizes Tversky indexing for DNA and RNA sequence data. This method offers a rapid and sensitive alternative for detecting HPV from both metagenomic and transcriptomic datasets. We assessed HPV-KITE using three previously analyzed HPV infection-related datasets, comprising a total of 1430 sequenced human samples. For benchmarking, we compared our method's performance with standard HPV sequencing analysis algorithms, including general sequence-based mapping, and k-mer-based classification methods. Parallelization demonstrated fast processing times achieved through shingling, and scalability analysis revealed optimal performance when employing multiple nodes. Our results showed that HPV-KITE is one of the fastest, most accurate, and easiest ways to detect HPV genotypes from virtually any next-generation sequencing data. Moreover, the method is also highly scalable and available to be optimized for any microorganism other than HPV.
Collapse
Affiliation(s)
- Marek Nowicki
- Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Tyniecka 15/17, PL-02-630 Warsaw, Poland
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University in Toruń, ul. Chopina 12/18, PL-87-100 Toruń, Poland
| | - Magdalena Mroczek
- Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Tyniecka 15/17, PL-02-630 Warsaw, Poland
- Department of Biomedicine, University Hospital Basel, University of Basel, Hebelstrasse 20, CH-4031 Basel, Switzerland
| | - Dhananjay Mukhedkar
- Department of Clinical Science, Intervention and Technology, Forskningsgatan 56, Karolinska University Hospital, Karolinska Institutet, SE-14186 Stockholm, Sweden
- Hopsworks AB, Åsögatan 119, SE-116 24 Stockholm, Sweden
| | - Piotr Bała
- Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Tyniecka 15/17, PL-02-630 Warsaw, Poland
| | - Ville Nikolai Pimenoff
- Department of Clinical Science, Intervention and Technology, Forskningsgatan 56, Karolinska University Hospital, Karolinska Institutet, SE-14186 Stockholm, Sweden
- Research Unit of Population Health and Borealis Biobank, Faculty of Medicine, University of Oulu, Aapistie 5 B, FI-90014 University of Oulu, Finland
| | - Laila Sara Arroyo Mühr
- Department of Clinical Science, Intervention and Technology, Forskningsgatan 56, Karolinska University Hospital, Karolinska Institutet, SE-14186 Stockholm, Sweden
| |
Collapse
|
4
|
Duan C, Zang Z, Xu Y, He H, Li S, Liu Z, Lei Z, Zheng JS, Li SZ. FGeneBERT: function-driven pre-trained gene language model for metagenomics. Brief Bioinform 2025; 26:bbaf149. [PMID: 40211978 PMCID: PMC11986344 DOI: 10.1093/bib/bbaf149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2024] [Revised: 02/22/2025] [Accepted: 03/14/2025] [Indexed: 04/14/2025] Open
Abstract
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the one-to-many and many-to-one relationships inherent in metagenomic data. To overcome these challenges, we introduce FGeneBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGeneBERT incorporates masked gene modeling to enhance the understanding of inter-gene contextual relationships and triplet enhanced metagenomic contrastive learning to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGeneBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1 to 213 k input sequences. Case studies of ATP synthase and gene operons highlight FGeneBERT's capability for functional recognition and its biological relevance in metagenomic research.
Collapse
Affiliation(s)
- Chenrui Duan
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Zelin Zang
- Centre for Artificial Intelligence and Robotics (CAIR), HKISI-CAS Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong 310000, China
| | - Yongjie Xu
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Hang He
- School of Medicine and School of Life Sciences, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Siyuan Li
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Zihan Liu
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Zhen Lei
- Centre for Artificial Intelligence and Robotics (CAIR), HKISI-CAS Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong 310000, China
- State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China
| | - Ju-Sheng Zheng
- School of Medicine and School of Life Sciences, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Stan Z Li
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| |
Collapse
|
5
|
Wu L, Liu Y, Shi W, Chang T, Liu P, Liu K, He Y, Li Z, Shi M, Jiao N, Lang AS, Dong X, Zheng Q. Uncovering the hidden RNA virus diversity in Lake Nam Co: Evolutionary insights from an extreme high-altitude environment. Proc Natl Acad Sci U S A 2025; 122:e2420162122. [PMID: 39903107 PMCID: PMC11831205 DOI: 10.1073/pnas.2420162122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Accepted: 12/23/2024] [Indexed: 02/06/2025] Open
Abstract
Alpine lakes, characterized by isolation, low temperatures, oligotrophic conditions, and intense ultraviolet radiation, remain a poorly explored ecosystem for RNA viruses. Here, we present the first comprehensive metatranscriptomic study of RNA viruses in Lake Nam Co, a high-altitude alkaline saline lake on the Tibetan Plateau. Using a combination of sequence- and structure-based homology searches, we identified 742 RNA virus species, including 383 novel genus-level groups and 84 novel family-level groups exclusively found in Lake Nam Co. These findings significantly expand the known diversity of the Orthornavirae, uncovering evolutionary adaptations such as permutated RNA-dependent RNA polymerase motifs and distinct RNA secondary structures. Notably, 14 additional RNA virus families potentially infecting prokaryotes were predicted, broadening the known host range of RNA viruses and questioning the traditional assumption that RNA viruses predominantly target eukaryotes. The presence of auxiliary metabolic genes in viral genomes suggested that RNA viruses (families f.0102 and Nam-Co_family_51) exploit host energy production mechanisms in energy-limited alpine lakes. Low nucleotide diversity, single nucleotide polymorphism frequencies, and pN/pS ratios indicate strong purifying selection in Nam Co viral populations. Our findings offer insights into RNA virus evolution and ecology, highlighting the importance of extreme environments in uncovering hidden viral diversity and further shed light into their potential ecological implications, particularly in the context of climate change.
Collapse
Affiliation(s)
- Lilin Wu
- Department of Marine Biology and Technology, College of Ocean and Earth Sciences and State Key Laboratory of Marine Environmental Science, Xiamen University, Xiamen361005, China
- Fujian Key Laboratory of Marine Carbon Sequestration, Xiamen University, Xiamen361005, China
| | - Yongqin Liu
- Center for the Pan-Third Pole Environment, Lanzhou University, Lanzhou730000, China
- State Key Laboratory of Tibetan Plateau Earth System, Resources and Environment, Institute of Tibetan Plateau Research, Chinese Academy of Sciences, Beijing100101, China
- University of Chinese Academy of Sciences, Beijing100049, China
| | - Wenqing Shi
- Department of Marine Biology and Technology, College of Ocean and Earth Sciences and State Key Laboratory of Marine Environmental Science, Xiamen University, Xiamen361005, China
- Fujian Key Laboratory of Marine Carbon Sequestration, Xiamen University, Xiamen361005, China
| | - Tianyi Chang
- Department of Marine Biology and Technology, College of Ocean and Earth Sciences and State Key Laboratory of Marine Environmental Science, Xiamen University, Xiamen361005, China
- Fujian Key Laboratory of Marine Carbon Sequestration, Xiamen University, Xiamen361005, China
| | - Pengfei Liu
- Center for the Pan-Third Pole Environment, Lanzhou University, Lanzhou730000, China
| | - Keshao Liu
- State Key Laboratory of Tibetan Plateau Earth System, Resources and Environment, Institute of Tibetan Plateau Research, Chinese Academy of Sciences, Beijing100101, China
- University of Chinese Academy of Sciences, Beijing100049, China
| | - Yong He
- Alibaba Cloud Intelligence, Alibaba Group, Hangzhou310013, China
| | - Zhaorong Li
- Alibaba Cloud Intelligence, Alibaba Group, Hangzhou310013, China
| | - Mang Shi
- Centre for Infection and Immunity Study, School of Medicine (Shenzhen), Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen518107, China
| | - Nianzhi Jiao
- Department of Marine Biology and Technology, College of Ocean and Earth Sciences and State Key Laboratory of Marine Environmental Science, Xiamen University, Xiamen361005, China
- Fujian Key Laboratory of Marine Carbon Sequestration, Xiamen University, Xiamen361005, China
| | - Andrew S. Lang
- Department of Biology, Memorial University of Newfoundland, St. John’s, NLA1C 5S7, Canada
| | - Xiyang Dong
- Key Laboratory of Marine Genetic Resources, Third Institute of Oceanography, Ministry of Natural Resources, Xiamen361005, China
| | - Qiang Zheng
- Department of Marine Biology and Technology, College of Ocean and Earth Sciences and State Key Laboratory of Marine Environmental Science, Xiamen University, Xiamen361005, China
- Fujian Key Laboratory of Marine Carbon Sequestration, Xiamen University, Xiamen361005, China
| |
Collapse
|
6
|
Nawaz MS, Nawaz MZ, Junyi Z, Fournier-Viger P, Qu JF. Exploiting the sequential nature of genomic data for improved analysis and identification. Comput Biol Med 2024; 183:109307. [PMID: 39488052 DOI: 10.1016/j.compbiomed.2024.109307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2024] [Revised: 09/18/2024] [Accepted: 10/18/2024] [Indexed: 11/04/2024]
Abstract
Genomic data is growing exponentially, posing new challenges for sequence analysis and classification, particularly for managing and understanding harmful new viruses that may later cause pandemics. Recent genome sequence classification models yield promising performance. However, the majority of them do not consider the sequential arrangement of nucleotides and amino acids, a critical aspect for uncovering their inherent structure and function. To overcome this, we introduce GenoAnaCla, a novel approach for analyzing and classifying genome sequences, based on sequential pattern mining (SPM). The proposed approach first constructs and preprocesses datasets comprising RNA virus genome sequences in three formats: nucleotide, coding region, and protein. Then, to capture sequential features for the analysis and classification of viruses, GenoAnaCla extracts frequent sequential patterns and rules in three forms and in codons. Eight classifiers are utilized, and their effectiveness is assessed by employing a variety of evaluation metrics. A performance comparison demonstrates that the suggested approach surpasses the current state-of-the-art genome sequence classification and detection techniques with a 3.18% performance increase in accuracy on average.
Collapse
Affiliation(s)
- M Saqib Nawaz
- College of Computer Science and Software Engineering, Shenzhen University, China.
| | - M Zohaib Nawaz
- College of Computer Science and Software Engineering, Shenzhen University, China; Faculty of Computing and Information Technology, Department of Computer Science, University of Sargodha, Pakistan.
| | - Zhang Junyi
- College of Computer Science and Software Engineering, Shenzhen University, China.
| | | | - Jun-Feng Qu
- School of Computer Engineering, Hubei University of Arts and Science, Xiangyang, Hubei, China.
| |
Collapse
|
7
|
Hou X, He Y, Fang P, Mei SQ, Xu Z, Wu WC, Tian JH, Zhang S, Zeng ZY, Gou QY, Xin GY, Le SJ, Xia YY, Zhou YL, Hui FM, Pan YF, Eden JS, Yang ZH, Han C, Shu YL, Guo D, Li J, Holmes EC, Li ZR, Shi M. Using artificial intelligence to document the hidden RNA virosphere. Cell 2024; 187:6929-6942.e16. [PMID: 39389057 DOI: 10.1016/j.cell.2024.09.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 08/01/2024] [Accepted: 09/16/2024] [Indexed: 10/12/2024]
Abstract
Current metagenomic tools can fail to identify highly divergent RNA viruses. We developed a deep learning algorithm, termed LucaProt, to discover highly divergent RNA-dependent RNA polymerase (RdRP) sequences in 10,487 metatranscriptomes generated from diverse global ecosystems. LucaProt integrates both sequence and predicted structural information, enabling the accurate detection of RdRP sequences. Using this approach, we identified 161,979 potential RNA virus species and 180 RNA virus supergroups, including many previously poorly studied groups, as well as RNA virus genomes of exceptional length (up to 47,250 nucleotides) and genomic complexity. A subset of these novel RNA viruses was confirmed by RT-PCR and RNA/DNA sequencing. Newly discovered RNA viruses were present in diverse environments, including air, hot springs, and hydrothermal vents, with virus diversity and abundance varying substantially among ecosystems. This study advances virus discovery, highlights the scale of the virosphere, and provides computational tools to better document the global RNA virome.
Collapse
Affiliation(s)
- Xin Hou
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Yong He
- Apsara Lab, Alibaba Cloud Intelligence, Alibaba Group, Hangzhou, China
| | - Pan Fang
- Apsara Lab, Alibaba Cloud Intelligence, Alibaba Group, Hangzhou, China
| | - Shi-Qiang Mei
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Zan Xu
- Apsara Lab, Alibaba Cloud Intelligence, Alibaba Group, Hangzhou, China
| | - Wei-Chen Wu
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Jun-Hua Tian
- Wuhan Centers for Disease Control and Prevention, Wuhan, China
| | - Shun Zhang
- Apsara Lab, Alibaba Cloud Intelligence, Alibaba Group, Hangzhou, China
| | - Zhen-Yu Zeng
- Apsara Lab, Alibaba Cloud Intelligence, Alibaba Group, Hangzhou, China
| | - Qin-Yu Gou
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Gen-Yang Xin
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Shi-Jia Le
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Yin-Yue Xia
- Polar Research Institute of China, Shanghai, China
| | - Yu-Lan Zhou
- Department of Nursing, The Fifth Affiliated Hospital, Sun Yat-sen University, Zhuhai, China
| | - Feng-Ming Hui
- School of Geospatial Engineering and Science, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Sun Yat-sen University, Zhuhai, China; Key Laboratory of Comprehensive Observation of Polar Environment, Ministry of Education, Sun Yat-sen University, Zhuhai, China
| | - Yuan-Fei Pan
- Ministry of Education Key Laboratory of Biodiversity Science and Ecological Engineering, National Observations and Research Station for Wetland Ecosystems of the Yangtze Estuary, Institute of Biodiversity Science and Institute of Eco-Chongming, School of Life Sciences, Fudan University Shanghai, Shanghai, China
| | - John-Sebastian Eden
- Centre for Virus Research, Westmead Institute for Medical Research, Westmead, NSW, Australia; School of Medical Sciences, The University of Sydney, Sydney, NSW 2006, Australia
| | - Zhao-Hui Yang
- College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Chong Han
- School of Life Science, Guangzhou University, Guangzhou, China
| | - Yue-Long Shu
- Key Laboratory of Pathogen Infection Prevention and Control (MOE), State Key Laboratory of Respiratory Health and Multimorbidity, National Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China; School of Public Health (Shenzhen), Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Deyin Guo
- Guangzhou National Laboratory, Guangzhou International Bio-Island, Guangzhou, China
| | - Jun Li
- Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong SAR, China
| | - Edward C Holmes
- School of Medical Sciences, The University of Sydney, Sydney, NSW 2006, Australia; Laboratory of Data Discovery for Health Limited, Hong Kong SAR, China.
| | - Zhao-Rong Li
- Apsara Lab, Alibaba Cloud Intelligence, Alibaba Group, Hangzhou, China.
| | - Mang Shi
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China; Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China; Guangdong Provincial Center for Disease Control and Prevention, Guangzhou, China.
| |
Collapse
|
8
|
Zárate A, Díaz-González L, Taboada B. VirDetect-AI: a residual and convolutional neural network-based metagenomic tool for eukaryotic viral protein identification. Brief Bioinform 2024; 26:bbaf001. [PMID: 39808116 PMCID: PMC11729733 DOI: 10.1093/bib/bbaf001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 11/12/2024] [Accepted: 08/01/2025] [Indexed: 01/16/2025] Open
Abstract
This study addresses the challenging task of identifying viruses within metagenomic data, which encompasses a broad array of biological samples, including animal reservoirs, environmental sources, and the human body. Traditional methods for virus identification often face limitations due to the diversity and rapid evolution of viral genomes. In response, recent efforts have focused on leveraging artificial intelligence (AI) techniques to enhance accuracy and efficiency in virus detection. However, existing AI-based approaches are primarily binary classifiers, lacking specificity in identifying viral types and reliant on nucleotide sequences. To address these limitations, VirDetect-AI, a novel tool specifically designed for the identification of eukaryotic viruses within metagenomic datasets, is introduced. The VirDetect-AI model employs a combination of convolutional neural networks and residual neural networks to effectively extract hierarchical features and detailed patterns from complex amino acid genomic data. The results demonstrated that the model has outstanding results in all metrics, with a sensitivity of 0.97, a precision of 0.98, and an F1-score of 0.98. VirDetect-AI improves our comprehension of viral ecology and can accurately classify metagenomic sequences into 980 viral protein classes, hence enabling the identification of new viruses. These classes encompass an extensive array of viral genera and families, as well as protein functions and hosts.
Collapse
Affiliation(s)
- Alida Zárate
- Doctorado en Ciencias, Instituto de Investigación en Ciencias Básicas Aplicadas (IICBA), Universidad Autónoma del Estado de Morelos, Cuernavaca, Morelos 62210, México
| | - Lorena Díaz-González
- Centro de Investigación en Ciencias, Universidad Autónoma del Estado de Morelos, Cuernavaca, Morelos 62210, México
| | - Blanca Taboada
- Departamento de Genética del Desarrollo y Fisiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| |
Collapse
|
9
|
Rahimian M, Panahi B. Metagenome sequence data mining for viral interaction studies: Review on progress and prospects. Virus Res 2024; 349:199450. [PMID: 39151562 PMCID: PMC11388672 DOI: 10.1016/j.virusres.2024.199450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Revised: 08/11/2024] [Accepted: 08/13/2024] [Indexed: 08/19/2024]
Abstract
Metagenomics has been greatly accelerated by the development of next-generation sequencing (NGS) technologies, which allow scientists to discover and describe novel microorganisms without the need for conventional culture techniques. Examining integrative bioinformatics methods used in viral interaction research, this study highlights metagenomic data from various contexts. Accurate viral identification depends on high-purity genetic material extraction, appropriate NGS platform selection, and sophisticated bioinformatics tools like VirPipe and VirFinder. The efficiency and precision of metagenomic analysis are further improved with the advent of AI-based techniques. The diversity and dynamics of viral communities are demonstrated by case studies from a variety of environments, emphasizing the seasonal and geographical variations that influence viral populations. In addition to speeding up the discovery of new viruses, metagenomics offers thorough understanding of virus-host interactions and their ecological effects. This review provides a promising framework for comprehending the complexity of viral communities and their interactions with hosts, highlighting the transformational potential of metagenomics and bioinformatics in viral research.
Collapse
Affiliation(s)
- Mohammadreza Rahimian
- Department of Biology, Faculty of Basic Sciences, University of Maragheh, Maragheh, Iran
| | - Bahman Panahi
- Department of Genomics, Branch for Northwest & West Region, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research, Education and Extension Organization (AREEO), Tabriz, Iran.
| |
Collapse
|
10
|
Miao Y, Sun Z, Lin C, Gu H, Ma C, Liang Y, Wang G. DeePhafier: a phage lifestyle classifier using a multilayer self-attention neural network combining protein information. Brief Bioinform 2024; 25:bbae377. [PMID: 39110476 PMCID: PMC11304974 DOI: 10.1093/bib/bbae377] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2024] [Revised: 07/04/2024] [Accepted: 07/19/2024] [Indexed: 08/10/2024] Open
Abstract
Bacteriophages are the viruses that infect bacterial cells. They are the most diverse biological entities on earth and play important roles in microbiome. According to the phage lifestyle, phages can be divided into the virulent phages and the temperate phages. Classifying virulent and temperate phages is crucial for further understanding of the phage-host interactions. Although there are several methods designed for phage lifestyle classification, they merely either consider sequence features or gene features, leading to low accuracy. A new computational method, DeePhafier, is proposed to improve classification performance on phage lifestyle. Built by several multilayer self-attention neural networks, a global self-attention neural network, and being combined by protein features of the Position Specific Scoring Matrix matrix, DeePhafier improves the classification accuracy and outperforms two benchmark methods. The accuracy of DeePhafier on five-fold cross-validation is as high as 87.54% for sequences with length >2000bp.
Collapse
Affiliation(s)
- Yan Miao
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, Heilongjiang, China
| | - Zhenyuan Sun
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, Heilongjiang, China
| | - Chen Lin
- National Institute for Data Science in Health and Medicine, Xiamen University, No. 4221 Xiangannan Road, Xiamen, 361102, Fujian, China
| | - Haoran Gu
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, Heilongjiang, China
| | - Chenjing Ma
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, Heilongjiang, China
| | - Yingjian Liang
- Key Laboratory of Hepatosplenic Surgery, Ministry of Education, Department of General Surgery, the First Affiliated Hospital of Harbin Medical University, No. 23 Postal Street, Harbin, 150007, Heilongjiang, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, Heilongjiang, China
| |
Collapse
|
11
|
Copeland CJ, Roddy JW, Schmidt AK, Secor P, Wheeler T. VIBES: a workflow for annotating and visualizing viral sequences integrated into bacterial genomes. NAR Genom Bioinform 2024; 6:lqae030. [PMID: 38584872 PMCID: PMC10993291 DOI: 10.1093/nargab/lqae030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 02/05/2024] [Accepted: 03/18/2024] [Indexed: 04/09/2024] Open
Abstract
Bacteriophages are viruses that infect bacteria. Many bacteriophages integrate their genomes into the bacterial chromosome and become prophages. Prophages may substantially burden or benefit host bacteria fitness, acting in some cases as parasites and in others as mutualists. Some prophages have been demonstrated to increase host virulence. The increasing ease of bacterial genome sequencing provides an opportunity to deeply explore prophage prevalence and insertion sites. Here we present VIBES (Viral Integrations in Bacterial genomES), a workflow intended to automate prophage annotation in complete bacterial genome sequences. VIBES provides additional context to prophage annotations by annotating bacterial genes and viral proteins in user-provided bacterial and viral genomes. The VIBES pipeline is implemented as a Nextflow-driven workflow, providing a simple, unified interface for execution on local, cluster and cloud computing environments. For each step of the pipeline, a container including all necessary software dependencies is provided. VIBES produces results in simple tab-separated format and generates intuitive and interactive visualizations for data exploration. Despite VIBES's primary emphasis on prophage annotation, its generic alignment-based design allows it to be deployed as a general-purpose sequence similarity search manager. We demonstrate the utility of the VIBES prophage annotation workflow by searching for 178 Pf phage genomes across 1072 Pseudomonas spp. genomes.
Collapse
Affiliation(s)
- Conner J Copeland
- Division of Biological Sciences, University of Montana, Missoula, MT, 59812, USA
| | - Jack W Roddy
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ, 85721, USA
| | - Amelia K Schmidt
- Division of Biological Sciences, University of Montana, Missoula, MT, 59812, USA
| | - Patrick R Secor
- Division of Biological Sciences, University of Montana, Missoula, MT, 59812, USA
| | - Travis J Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ, 85721, USA
| |
Collapse
|
12
|
Sun K, Fu K, Hu T, Shentu X, Yu X. Leveraging insect viruses and genetic manipulation for sustainable agricultural pest control. PEST MANAGEMENT SCIENCE 2024; 80:2515-2527. [PMID: 37948321 DOI: 10.1002/ps.7878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 10/16/2023] [Accepted: 11/11/2023] [Indexed: 11/12/2023]
Abstract
The potential of insect viruses in the biological control of agricultural pests is well-recognized, yet their practical application faces obstacles such as host specificity, variable virulence, and resource scarcity. High-throughput sequencing (HTS) technologies have significantly advanced our capabilities in discovering and identifying new insect viruses, thereby enriching the arsenal for pest management. Concurrently, progress in reverse genetics has facilitated the development of versatile viral expression vectors. These vectors have enhanced the specificity and effectiveness of insect viruses in targeting specific pests, offering a more precise approach to pest control. This review provides a comprehensive examination of the methodologies employed in the identification of insect viruses using HTS. Additionally, it explores the domain of genetically modified insect viruses and their associated challenges in pest management. The adoption of these cutting-edge approaches holds great promise for developing environmentally sustainable and effective pest control solutions. © 2023 Society of Chemical Industry.
Collapse
Affiliation(s)
- Kai Sun
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection & Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
| | - Kang Fu
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection & Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
| | - Tao Hu
- Zhejinag Seed Industry Group Xinchuang Bio-breeding Co., Ltd., Hangzhou, China
| | - Xuping Shentu
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection & Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
| | - Xiaoping Yu
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection & Quarantine, College of Life Sciences, China Jiliang University, Hangzhou, China
| |
Collapse
|
13
|
Hegarty B, Riddell V J, Bastien E, Langenfeld K, Lindback M, Saini JS, Wing A, Zhang J, Duhaime M. Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods. mSystems 2024; 9:e0110523. [PMID: 38376167 PMCID: PMC10949488 DOI: 10.1128/msystems.01105-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 01/24/2024] [Indexed: 02/21/2024] Open
Abstract
Understanding the ecological impacts of viruses on natural and engineered ecosystems relies on the accurate identification of viral sequences from community sequencing data. To maximize viral recovery from metagenomes, researchers frequently combine viral identification tools. However, the effectiveness of this strategy is unknown. Here, we benchmarked combinations of six widely used informatics tools for viral identification and analysis (VirSorter, VirSorter2, VIBRANT, DeepVirFinder, CheckV, and Kaiju), called "rulesets." Rulesets were tested against mock metagenomes composed of taxonomically diverse sequence types and diverse aquatic metagenomes to assess the effects of the degree of viral enrichment and habitat on tool performance. We found that six rulesets achieved equivalent accuracy [Matthews Correlation Coefficient (MCC) = 0.77, Padj ≥ 0.05]. Each contained VirSorter2, and five used our "tuning removal" rule designed to remove non-viral contamination. While DeepVirFinder, VIBRANT, and VirSorter were each found once in these high-accuracy rulesets, they were not found in combination with each other: combining tools does not lead to optimal performance. Our validation suggests that the MCC plateau at 0.77 is partly caused by inaccurate labeling within reference sequence databases. In aquatic metagenomes, our highest MCC ruleset identified more viral sequences in virus-enriched (44%-46%) than in cellular metagenomes (7%-19%). While improved algorithms may lead to more accurate viral identification tools, this should be done in tandem with careful curation of sequence databases. We recommend using the VirSorter2 ruleset and our empirically derived tuning removal rule. Our analysis provides insight into methods for in silico viral identification and will enable more robust viral identification from metagenomic data sets. IMPORTANCE The identification of viruses from environmental metagenomes using informatics tools has offered critical insights in microbial ecology. However, it remains difficult for researchers to know which tools optimize viral recovery for their specific study. In an attempt to recover more viruses, studies are increasingly combining the outputs from multiple tools without validating this approach. After benchmarking combinations of six viral identification tools against mock metagenomes and environmental samples, we found that these tools should only be combined cautiously. Two to four tool combinations maximized viral recovery and minimized non-viral contamination compared with either the single-tool or the five- to six-tool ones. By providing a rigorous overview of the behavior of in silico viral identification strategies and a pipeline to replicate our process, our findings guide the use of existing viral identification tools and offer a blueprint for feature engineering of new tools that will lead to higher-confidence viral discovery in microbiome studies.
Collapse
Affiliation(s)
- Bridget Hegarty
- Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio, USA
| | - James Riddell V
- Department of Microbiology, The Ohio State University, Columbus, Ohio, USA
| | - Eric Bastien
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| | - Kathryn Langenfeld
- Department of Civil and Environmental Engineering, Stanford University, Palo Alto, California, USA
| | - Morgan Lindback
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| | - Jaspreet S. Saini
- Laboratory for Environmental Biotechnology, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Anthony Wing
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| | - Jessica Zhang
- Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, Michigan, USA
| | - Melissa Duhaime
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
14
|
Miao Y, Sun Z, Ma C, Lin C, Wang G, Yang C. VirGrapher: a graph-based viral identifier for long sequences from metagenomes. Brief Bioinform 2024; 25:bbae036. [PMID: 38343326 PMCID: PMC10859693 DOI: 10.1093/bib/bbae036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 01/15/2024] [Accepted: 01/18/2024] [Indexed: 02/15/2024] Open
Abstract
Viruses are the most abundant biological entities on earth and are important components of microbial communities. A metagenome contains all microorganisms from an environmental sample. Correctly identifying viruses from these mixed sequences is critical in viral analyses. It is common to identify long viral sequences, which has already been passed thought pipelines of assembly and binning. Existing deep learning-based methods divide these long sequences into short subsequences and identify them separately. This makes the relationships between them be omitted, leading to poor performance on identifying long viral sequences. In this paper, VirGrapher is proposed to improve the identification performance of long viral sequences by constructing relationships among short subsequences from long ones. VirGrapher see a long sequence as a graph and uses a Graph Convolutional Network (GCN) model to learn multilayer connections between nodes from sequences after a GCN-based node embedding model. VirGrapher achieves a better AUC value and accuracy on validation set, which is better than three benchmark methods.
Collapse
Affiliation(s)
- Yan Miao
- College of Computer and Control Engineering, Northeast Forestry University, Hexing Road, 150040, Heilongjiang Province, China
| | - Zhenyuan Sun
- College of Computer and Control Engineering, Northeast Forestry University, Hexing Road, 150040, Heilongjiang Province, China
| | - Chenjing Ma
- College of Computer and Control Engineering, Northeast Forestry University, Hexing Road, 150040, Heilongjiang Province, China
| | - Chen Lin
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiangannan Road, 361104, Fujian Province, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Hexing Road, 150040, Heilongjiang Province, China
| | - Chunxue Yang
- College of Landscape Architecture, Northeast Forestry University, Hexing Road, 150040, Heilongjiang Province, China
| |
Collapse
|
15
|
Roach MJ, Beecroft SJ, Mihindukulasuriya KA, Wang L, Paredes A, Cárdenas LAC, Henry-Cocks K, Lima LFO, Dinsdale EA, Edwards RA, Handley SA. Hecatomb: an integrated software platform for viral metagenomics. Gigascience 2024; 13:giae020. [PMID: 38832467 PMCID: PMC11148595 DOI: 10.1093/gigascience/giae020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 01/18/2024] [Accepted: 04/08/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND Modern sequencing technologies offer extraordinary opportunities for virus discovery and virome analysis. Annotation of viral sequences from metagenomic data requires a complex series of steps to ensure accurate annotation of individual reads and assembled contigs. In addition, varying study designs will require project-specific statistical analyses. FINDINGS Here we introduce Hecatomb, a bioinformatic platform coordinating commonly used tasks required for virome analysis. Hecatomb means "a great sacrifice." In this setting, Hecatomb is "sacrificing" false-positive viral annotations using extensive quality control and tiered-database searches. Hecatomb processes metagenomic data obtained from both short- and long-read sequencing technologies, providing annotations to individual sequences and assembled contigs. Results are provided in commonly used data formats useful for downstream analysis. Here we demonstrate the functionality of Hecatomb through the reanalysis of a primate enteric and a novel coral reef virome. CONCLUSION Hecatomb provides an integrated platform to manage many commonly used steps for virome characterization, including rigorous quality control, host removal, and both read- and contig-based analysis. Each step is managed using the Snakemake workflow manager with dependency management using Conda. Hecatomb outputs several tables properly formatted for immediate use within popular data analysis and visualization tools, enabling effective data interpretation for a variety of study designs. Hecatomb is hosted on GitHub (github.com/shandley/hecatomb) and is available for installation from Bioconda and PyPI.
Collapse
Affiliation(s)
- Michael J Roach
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia
- Adelaide Centre for Epigenetics, University of Adelaide, Adelaide, SA, 5005, Australia
- South Australian Immunogenomics Cancer Institute, University of Adelaide, Adelaide, SA, 5005, Australia
| | - Sarah J Beecroft
- Harry Perkins Institute of Medical Research, Perth, WA, 6009, Australia
| | - Kathie A Mihindukulasuriya
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| | - Leran Wang
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| | - Anne Paredes
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| | - Luis Alberto Chica Cárdenas
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| | - Kara Henry-Cocks
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia
| | | | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia
| | - Scott A Handley
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| |
Collapse
|
16
|
Wang J, Chen C, Yao G, Ding J, Wang L, Jiang H. Intelligent Protein Design and Molecular Characterization Techniques: A Comprehensive Review. Molecules 2023; 28:7865. [PMID: 38067593 PMCID: PMC10707872 DOI: 10.3390/molecules28237865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 11/13/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023] Open
Abstract
In recent years, the widespread application of artificial intelligence algorithms in protein structure, function prediction, and de novo protein design has significantly accelerated the process of intelligent protein design and led to many noteworthy achievements. This advancement in protein intelligent design holds great potential to accelerate the development of new drugs, enhance the efficiency of biocatalysts, and even create entirely new biomaterials. Protein characterization is the key to the performance of intelligent protein design. However, there is no consensus on the most suitable characterization method for intelligent protein design tasks. This review describes the methods, characteristics, and representative applications of traditional descriptors, sequence-based and structure-based protein characterization. It discusses their advantages, disadvantages, and scope of application. It is hoped that this could help researchers to better understand the limitations and application scenarios of these methods, and provide valuable references for choosing appropriate protein characterization techniques for related research in the field, so as to better carry out protein research.
Collapse
Affiliation(s)
| | | | | | - Junjie Ding
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| | - Liangliang Wang
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| | - Hui Jiang
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| |
Collapse
|
17
|
Fu P, Wu Y, Zhang Z, Qiu Y, Wang Y, Peng Y. VIGA: a one-stop tool for eukaryotic virus identification and genome assembly from next-generation-sequencing data. Brief Bioinform 2023; 25:bbad444. [PMID: 38048079 PMCID: PMC10753531 DOI: 10.1093/bib/bbad444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 10/26/2023] [Accepted: 11/11/2023] [Indexed: 12/05/2023] Open
Abstract
Identification of viruses and further assembly of viral genomes from the next-generation-sequencing data are essential steps in virome studies. This study presented a one-stop tool named VIGA (available at https://github.com/viralInformatics/VIGA) for eukaryotic virus identification and genome assembly from NGS data. It was composed of four modules, namely, identification, taxonomic annotation, assembly and novel virus discovery, which integrated several third-party tools such as BLAST, Trinity, MetaCompass and RagTag. Evaluation on multiple simulated and real virome datasets showed that VIGA assembled more complete virus genomes than its competitors on both the metatranscriptomic and metagenomic data and performed well in assembling virus genomes at the strain level. Finally, VIGA was used to investigate the virome in metatranscriptomic data from the Human Microbiome Project and revealed different composition and positive rate of viromes in diseases of prediabetes, Crohn's disease and ulcerative colitis. Overall, VIGA would help much in identification and characterization of viromes, especially the known viruses, in future studies.
Collapse
Affiliation(s)
- Ping Fu
- Bioinformatics Center, College of Biology, Hunan Provincial Key Laboratory of Medical Virology, Hunan University, Changsha 410082, China
| | - Yifan Wu
- Bioinformatics Center, College of Biology, Hunan Provincial Key Laboratory of Medical Virology, Hunan University, Changsha 410082, China
| | - Zhiyuan Zhang
- Bioinformatics Center, College of Biology, Hunan Provincial Key Laboratory of Medical Virology, Hunan University, Changsha 410082, China
| | - Ye Qiu
- Bioinformatics Center, College of Biology, Hunan Provincial Key Laboratory of Medical Virology, Hunan University, Changsha 410082, China
| | - Yirong Wang
- Bioinformatics Center, College of Biology, Hunan Provincial Key Laboratory of Medical Virology, Hunan University, Changsha 410082, China
| | - Yousong Peng
- Bioinformatics Center, College of Biology, Hunan Provincial Key Laboratory of Medical Virology, Hunan University, Changsha 410082, China
| |
Collapse
|
18
|
Copeland CJ, Roddy JW, Schmidt AK, Secor PR, Wheeler TJ. VIBES: A Workflow for Annotating and Visualizing Viral Sequences Integrated into Bacterial Genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.17.562434. [PMID: 37905003 PMCID: PMC10614876 DOI: 10.1101/2023.10.17.562434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Bacteriophages are viruses that infect bacteria. Many bacteriophages integrate their genomes into the bacterial chromosome and become prophages. Prophages may substantially burden or benefit host bacteria fitness, acting in some cases as parasites and in others as mutualists, and have been demonstrated to increase host virulence. The increasing ease of bacterial genome sequencing provides an opportunity to deeply explore prophage prevalence and insertion sites. Here we present VIBES, a workflow intended to automate prophage annotation in complete bacterial genome sequences. VIBES provides additional context to prophage annotations by annotating bacterial genes and viral proteins in user-provided bacterial and viral genomes. The VIBES pipeline is implemented as a Nextflow-driven workflow, providing a simple, unified interface for execution on local, cluster, and cloud computing environments. For each step of the pipeline, a container including all necessary software dependencies is provided. VIBES produces results in simple tab separated format and generates intuitive and interactive visualizations for data exploration. Despite VIBES' primary emphasis on prophage annotation, its generic alignment-based design allows it to be deployed as a general-purpose sequence similarity search manager. We demonstrate the utility of the VIBES prophage annotation workflow by searching for 178 Pf phage genomes across 1,072 Pseudomonas spp. genomes. VIBES software is available at https://github.com/TravisWheelerLab/VIBES.
Collapse
Affiliation(s)
- Conner J. Copeland
- Division of Biological Sciences, University of Montana, Missoula, MT, USA
| | - Jack W. Roddy
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ, USA
| | - Amelia K. Schmidt
- Division of Biological Sciences, University of Montana, Missoula, MT, USA
| | - Patrick R. Secor
- Division of Biological Sciences, University of Montana, Missoula, MT, USA
| | - Travis J. Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ, USA
| |
Collapse
|
19
|
Qi YH, Ye ZX, Zhang CX, Chen JP, Li JM. Diversity of RNA viruses in agricultural insects. Comput Struct Biotechnol J 2023; 21:4312-4321. [PMID: 37711182 PMCID: PMC10497914 DOI: 10.1016/j.csbj.2023.08.036] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/31/2023] [Accepted: 08/31/2023] [Indexed: 09/16/2023] Open
Abstract
Recent advancements in next-generation sequencing (NGS) technology and bioinformatics tools have revealed a vast array of viral diversity in insects, particularly RNA viruses. However, our current understanding of insect RNA viruses has primarily focused on hematophagous insects due to their medical importance, while research on the viromes of agriculturally relevant insects remains limited. This comprehensive review aims to address the gap by providing an overview of the diversity of RNA viruses in agricultural pests and beneficial insects within the agricultural ecosystem. Based on the NCBI Virus Database, over eight hundred RNA viruses belonging to 39 viral families have been reported in more than three hundred agricultural insect species. These viruses are predominantly found in the insect orders of Hymenoptera, Hemiptera, Thysanoptera, Lepidoptera, Diptera, Coleoptera, and Orthoptera. These findings have significantly enriched our understanding of RNA viral diversity in agricultural insects. While further virome investigations are necessary to expand our knowledge to more insect species, it is crucial to explore the biological roles of these identified RNA viruses within insects in future studies. This review also highlights the limitations and challenges for the effective virus discovery through NGS and their potential solutions, which might facilitate for the development of innovative bioinformatic tools in the future.
Collapse
Affiliation(s)
- Yu-Hua Qi
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-products, Key Laboratory of Biotechnology in Plant Protection of Ministry of Agriculture and Zhejiang Province, Institute of Plant Virology, Ningbo University, Ningbo 315211, China
| | - Zhuang-Xin Ye
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-products, Key Laboratory of Biotechnology in Plant Protection of Ministry of Agriculture and Zhejiang Province, Institute of Plant Virology, Ningbo University, Ningbo 315211, China
| | - Chuan-Xi Zhang
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-products, Key Laboratory of Biotechnology in Plant Protection of Ministry of Agriculture and Zhejiang Province, Institute of Plant Virology, Ningbo University, Ningbo 315211, China
| | - Jian-Ping Chen
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-products, Key Laboratory of Biotechnology in Plant Protection of Ministry of Agriculture and Zhejiang Province, Institute of Plant Virology, Ningbo University, Ningbo 315211, China
| | - Jun-Min Li
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-products, Key Laboratory of Biotechnology in Plant Protection of Ministry of Agriculture and Zhejiang Province, Institute of Plant Virology, Ningbo University, Ningbo 315211, China
| |
Collapse
|
20
|
Miao Y, Bian J, Dong G, Dai T. DETIRE: a hybrid deep learning model for identifying viral sequences from metagenomes. Front Microbiol 2023; 14:1169791. [PMID: 37396369 PMCID: PMC10313334 DOI: 10.3389/fmicb.2023.1169791] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 05/18/2023] [Indexed: 07/04/2023] Open
Abstract
A metagenome contains all DNA sequences from an environmental sample, including viruses, bacteria, archaea, and eukaryotes. Since viruses are of huge abundance and have caused vast mortality and morbidity to human society in history as a type of major pathogens, detecting viruses from metagenomes plays a crucial role in analyzing the viral component of samples and is the very first step for clinical diagnosis. However, detecting viral fragments directly from the metagenomes is still a tough issue because of the existence of a huge number of short sequences. In this study a hybrid Deep lEarning model for idenTifying vIral sequences fRom mEtagenomes (DETIRE) is proposed to solve the problem. First, the graph-based nucleotide sequence embedding strategy is utilized to enrich the expression of DNA sequences by training an embedding matrix. Then, the spatial and sequential features are extracted by trained CNN and BiLSTM networks, respectively, to enrich the features of short sequences. Finally, the two sets of features are weighted combined for the final decision. Trained by 220,000 sequences of 500 bp subsampled from the Virus and Host RefSeq genomes, DETIRE identifies more short viral sequences (<1,000 bp) than the three latest methods, such as DeepVirFinder, PPR-Meta, and CHEER. DETIRE is freely available at Github (https://github.com/crazyinter/DETIRE).
Collapse
|
21
|
Yang X, Qin S, Liu X, Zhang N, Chen J, Jin M, Liu F, Wang Y, Guo J, Shi H, Wang C, Chen Y. Meta-Viromic Sequencing Reveals Virome Characteristics of Mosquitoes and Culicoides on Zhoushan Island, China. Microbiol Spectr 2023; 11:e0268822. [PMID: 36651764 PMCID: PMC9927462 DOI: 10.1128/spectrum.02688-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Mosquitoes and biting Culicoides species are arbovirus vectors. Effective virome profile surveillance is essential for the prevention and control of insect-borne diseases. From June to September 2021, we collected eight species of female mosquito and Culicoides on Zhoushan Island, China, and used meta-viromic sequencing to analyze their virome compositions and characteristics. The classified virus reads were distributed in 191 genera in 66 families. The virus sequences in mosquitoes with the largest proportions were Iflaviridae (30.03%), Phasmaviridae (23.09%), Xinmoviridae (21.82%), Flaviviridae (13.44%), and Rhabdoviridae (8.40%). Single-strand RNA+ viruses formed the largest proportions of viruses in all samples. Blood meals indicated that blood-sucking mosquito hosts were mainly chicken, duck, pig, and human, broadly consistent with the habitats where the mosquitoes were collected. Novel viruses of the Orthobunyavirus, Narnavirus, and Iflavirus genera were found in Culicoides by de-novo assembly. The viruses with vertebrate hosts carried by mosquitoes and Culicoides also varied widely. The analysis of unclassified viruses and deep-learning analysis of the "dark matter" in the meta-viromic sequencing data revealed the presence of a large number of unknown viruses. IMPORTANCE The monitoring of the viromes of mosquitoes and Culicoides, widely distributed arbovirus transmission vectors, is crucial to evaluate the risk of infectious disease transmission. In this study, the compositions of the viromes of mosquitoes and Culicoides on Zhoushan Island varied widely and were related mainly to the host species, with different host species having different core viromes. and many unknown sequences in the Culicoides viromes remain to be annotated, suggesting the presence of a large number of unknown viruses.
Collapse
Affiliation(s)
- Xiaojing Yang
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Shiyu Qin
- College of Public Health, Zhengzhou University, Zhengzhou, Henan Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Xiong Liu
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Na Zhang
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Jiali Chen
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Meiling Jin
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Fangni Liu
- School of Public Health, China Medical University, Shenyang, Liaoning Province, China
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Yong Wang
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Jinpeng Guo
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Hua Shi
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Changjun Wang
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| | - Yong Chen
- Chinese PLA Center for Disease Control and Prevention, Beijing, China
| |
Collapse
|
22
|
Shang J, Tang X, Guo R, Sun Y. Accurate identification of bacteriophages from metagenomic data using Transformer. Brief Bioinform 2022; 23:6620872. [PMID: 35769000 PMCID: PMC9294416 DOI: 10.1093/bib/bbac258] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 05/22/2022] [Accepted: 06/04/2022] [Indexed: 11/20/2022] Open
Abstract
Motivation Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. Results In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins’ positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27%.
Collapse
Affiliation(s)
- Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Xubo Tang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Ruocheng Guo
- School of Data Science, City University of Hong Kong, Hong Kong (SAR), China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| |
Collapse
|
23
|
Multi-task learning to leverage partially annotated data for PPI interface prediction. Sci Rep 2022; 12:10487. [PMID: 35729253 PMCID: PMC9213449 DOI: 10.1038/s41598-022-13951-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 05/31/2022] [Indexed: 11/29/2022] Open
Abstract
Protein protein interactions (PPI) are crucial for protein functioning, nevertheless predicting residues in PPI interfaces from the protein sequence remains a challenging problem. In addition, structure-based functional annotations, such as the PPI interface annotations, are scarce: only for about one-third of all protein structures residue-based PPI interface annotations are available. If we want to use a deep learning strategy, we have to overcome the problem of limited data availability. Here we use a multi-task learning strategy that can handle missing data. We start with the multi-task model architecture, and adapted it to carefully handle missing data in the cost function. As related learning tasks we include prediction of secondary structure, solvent accessibility, and buried residue. Our results show that the multi-task learning strategy significantly outperforms single task approaches. Moreover, only the multi-task strategy is able to effectively learn over a dataset extended with structural feature data, without additional PPI annotations. The multi-task setup becomes even more important, if the fraction of PPI annotations becomes very small: the multi-task learner trained on only one-eighth of the PPI annotations—with data extension—reaches the same performances as the single-task learner on all PPI annotations. Thus, we show that the multi-task learning strategy can be beneficial for a small training dataset where the protein’s functional properties of interest are only partially annotated.
Collapse
|
24
|
A novel liver cancer diagnosis method based on patient similarity network and DenseGCN. Sci Rep 2022; 12:6797. [PMID: 35474072 PMCID: PMC9043215 DOI: 10.1038/s41598-022-10441-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 04/05/2022] [Indexed: 11/17/2022] Open
Abstract
Liver cancer is the main malignancy in terms of mortality rate, accurate diagnosis can help the treatment outcome of liver cancer. Patient similarity network is an important information which helps in cancer diagnosis. However, recent works rarely take patient similarity into consideration. To address this issue, we constructed patient similarity network using three liver cancer omics data, and proposed a novel liver cancer diagnosis method consisted of similarity network fusion, denoising autoencoder and dense graph convolutional neural network to capitalize on patient similarity network and multi omics data. We compared our proposed method with other state-of-the-art methods and machine learning methods on TCGA-LIHC dataset to evaluate its performance. The results confirmed that our proposed method surpasses these comparison methods in terms of all the metrics. Especially, our proposed method has attained an accuracy up to 0.9857.
Collapse
|