1
|
Consul S, Robertson J, Vikalo H. XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples. J Comput Biol 2025. [PMID: 40392695 DOI: 10.1089/cmb.2025.0075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2025] Open
Abstract
It is estimated that approximately 15% of cancers worldwide can be linked to viral infections. The viruses that can cause or increase the risk of cancer include human papillomavirus, hepatitis B and C viruses, Epstein-Barr virus, and human immunodeficiency virus, to name a few. The computational analysis of the massive amounts of tumor DNA data, whose collection is enabled by the advancements in sequencing technologies, has allowed studies of the potential association between cancers and viral pathogens. However, the high diversity of oncoviral families makes reliable detection of viral DNA difficult, and the training of machine learning models that enable such analysis computationally challenging. We introduce XVir, a data pipeline that deploys a transformer-based deep learning architecture to reliably identify viral DNA present in human tumors. XVir is trained on a mix of sequencing reads coming from viral and human genomes, resulting in a model capable of robust detection of potentially mutated viral DNA across a range of experimental settings. Results on semi-experimental data demonstrate that XVir is able to achieve high classification accuracy, generally outperforming state-of-the-art competing methods. In particular, it retains high accuracy even when faced with diverse viral populations while being significantly faster to train than other large deep learning-based classifiers.
Collapse
Affiliation(s)
- Shorya Consul
- Chandra Family Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas, USA
| | - John Robertson
- Chandra Family Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas, USA
| | - Haris Vikalo
- Chandra Family Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
2
|
Wang Y, Dutta R, Futschik A. Estimating Haplotype Structure and Frequencies: A Bayesian Approach to Unknown Design in Pooled Genomic Data. J Comput Biol 2024; 31:708-726. [PMID: 38957993 DOI: 10.1089/cmb.2023.0211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/04/2024] Open
Abstract
The estimation of haplotype structure and frequencies provides crucial information about the composition of genomes. Techniques, such as single-individual haplotyping, aim to reconstruct individual haplotypes from diploid genome sequencing data. However, our focus is distinct. We address the challenge of reconstructing haplotype structure and frequencies from pooled sequencing samples where multiple individuals are sequenced simultaneously. A frequentist method to address this issue has recently been proposed. In contrast to this and other methods that compute point estimates, our proposed Bayesian hierarchical model delivers a posterior that permits us to also quantify uncertainty. Since matching permutations in both haplotype structure and corresponding frequency matrix lead to the same reconstruction of their product, we introduce an order-preserving shrinkage prior that ensures identifiability with respect to permutations. For inference, we introduce a blocked Gibbs sampler that enforces the required constraints. In a simulation study, we assessed the performance of our method. Furthermore, by using our approach on two distinct sets of real data, we demonstrate that our Bayesian approach can reconstruct the dominant haplotypes in a challenging, high-dimensional set-up.
Collapse
Affiliation(s)
- Yuexuan Wang
- Department of Applied Statistics, Johannes Kepler University, Linz, Austria
| | - Ritabrata Dutta
- Department of Statistics, University of Warwick, Coventry, United Kingdom
| | - Andreas Futschik
- Department of Applied Statistics, Johannes Kepler University, Linz, Austria
| |
Collapse
|
3
|
Delgado S, Somovilla P, Ferrer-Orta C, Martínez-González B, Vázquez-Monteagudo S, Muñoz-Flores J, Soria ME, García-Crespo C, de Ávila AI, Durán-Pastor A, Gadea I, López-Galíndez C, Moran F, Lorenzo-Redondo R, Verdaguer N, Perales C, Domingo E. Incipient functional SARS-CoV-2 diversification identified through neural network haplotype maps. Proc Natl Acad Sci U S A 2024; 121:e2317851121. [PMID: 38416684 PMCID: PMC10927536 DOI: 10.1073/pnas.2317851121] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 01/08/2024] [Indexed: 03/01/2024] Open
Abstract
Since its introduction in the human population, SARS-CoV-2 has evolved into multiple clades, but the events in its intrahost diversification are not well understood. Here, we compare three-dimensional (3D) self-organized neural haplotype maps (SOMs) of SARS-CoV-2 from thirty individual nasopharyngeal diagnostic samples obtained within a 19-day interval in Madrid (Spain), at the time of transition between clades 19 and 20. SOMs have been trained with the haplotype repertoire present in the mutant spectra of the nsp12- and spike (S)-coding regions. Each SOM consisted of a dominant neuron (displaying the maximum frequency), surrounded by a low-frequency neuron cloud. The sequence of the master (dominant) neuron was either identical to that of the reference Wuhan-Hu-1 genome or differed from it at one nucleotide position. Six different deviant haplotype sequences were identified among the master neurons. Some of the substitutions in the neural clouds affected critical sites of the nsp12-nsp8-nsp7 polymerase complex and resulted in altered kinetics of RNA synthesis in an in vitro primer extension assay. Thus, the analysis has identified mutations that are relevant to modification of viral RNA synthesis, present in the mutant clouds of SARS-CoV-2 quasispecies. These mutations most likely occurred during intrahost diversification in several COVID-19 patients, during an initial stage of the pandemic, and within a brief time period.
Collapse
Affiliation(s)
- Soledad Delgado
- Departamento de Sistemas Informáticos, Escuela Técnica Superior de Ingeniería de Sistemas Informáticos, Universidad Politécnica de Madrid, Madrid28031, Spain
| | - Pilar Somovilla
- Microbes in Health and Welfare Program, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas, Madrid28049, Spain
- Departamento de Biología Molecular, Universidad Autónoma de Madrid, Madrid28049, Spain
| | - Cristina Ferrer-Orta
- Structural and Molecular Biology Department, Institut de Biología Molecular de Barcelona, Consejo Superior de Investigaciones Científicas, Barcelona08028, Spain
| | - Brenda Martínez-González
- Department of Molecular and Cell Biology, Centro Nacional de Biotecnología, Consejo Superior de Investigaciones Científicas, Madrid28049, Spain
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid, Madrid28040, Spain
| | - Sergi Vázquez-Monteagudo
- Structural and Molecular Biology Department, Institut de Biología Molecular de Barcelona, Consejo Superior de Investigaciones Científicas, Barcelona08028, Spain
| | | | - María Eugenia Soria
- Microbes in Health and Welfare Program, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas, Madrid28049, Spain
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid, Madrid28040, Spain
| | - Carlos García-Crespo
- Microbes in Health and Welfare Program, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas, Madrid28049, Spain
| | - Ana Isabel de Ávila
- Microbes in Health and Welfare Program, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas, Madrid28049, Spain
| | - Antoni Durán-Pastor
- Department of Molecular and Cell Biology, Centro Nacional de Biotecnología, Consejo Superior de Investigaciones Científicas, Madrid28049, Spain
| | - Ignacio Gadea
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid, Madrid28040, Spain
| | - Cecilio López-Galíndez
- Unidad de Virología Molecular, Laboratorio de Referencia e Investigación en retrovirus, Centro Nacional de Microbiología, Instituto de salud Carlos III, Majadahonda28222, Spain
| | - Federico Moran
- Departamento de Bioquímica y Biología Molecular, Universidad Complutense de Madrid, Madrid28040, Spain
| | - Ramon Lorenzo-Redondo
- Department of Medicine, Division of Infectious Diseases, Northwestern University Feinberg School of Medicine, Center for Pathogen Genomics and Microbial Evolution, Northwestern University Havey Institute for Global Health, Chicago, IL60611
| | - Nuria Verdaguer
- Structural and Molecular Biology Department, Institut de Biología Molecular de Barcelona, Consejo Superior de Investigaciones Científicas, Barcelona08028, Spain
| | - Celia Perales
- Department of Molecular and Cell Biology, Centro Nacional de Biotecnología, Consejo Superior de Investigaciones Científicas, Madrid28049, Spain
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid, Madrid28040, Spain
| | - Esteban Domingo
- Microbes in Health and Welfare Program, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas, Madrid28049, Spain
| |
Collapse
|
4
|
Ke Z, Vikalo H. Graph-Based Reconstruction and Analysis of Disease Transmission Networks Using Viral Genomic Data. J Comput Biol 2023. [PMID: 37347892 DOI: 10.1089/cmb.2022.0373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/24/2023] Open
Abstract
Understanding the patterns of viral disease transmissions helps establish public health policies and aids in controlling and ending a disease outbreak. Classical methods for studying disease transmission dynamics that rely on epidemiological data, such as times of sample collection and duration of exposure intervals, struggle to provide desired insight due to limited informativeness of such data. A more precise characterization of disease transmissions may be acquired from sequencing data that reveal genetic distance between viral genomes in patient samples. Indeed, genetic distance between viral strains present in hosts contains valuable information about transmission history, thus motivating the design of methods that rely on genomic data to reconstruct a directed disease transmission network, detect transmission clusters, and identify significant network nodes (e.g., super-spreaders). In this article, we present a novel end-to-end framework for the analysis of viral transmissions utilizing viral genomic (sequencing) data. The proposed framework groups infected hosts into transmission clusters based on the reconstructed viral strains infecting them; the genetic distance between a pair of hosts is calculated using Earth Mover's Distance, and further used to infer transmission direction between the hosts. To quantify the significance of a host in the transmission network, the importance score is calculated by a graph convolutional autoencoder. The viral transmission network is represented by a directed minimum spanning tree utilizing the Edmond's algorithm modified to incorporate constraints on the importance scores of the hosts. The proposed framework outperforms state-of-the-art techniques for the analysis of viral transmission dynamics in several experiments on semiexperimental as well as experimental data.
Collapse
Affiliation(s)
- Ziqi Ke
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas, USA
| | - Haris Vikalo
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
5
|
Tithi SS, Aylward FO, Jensen RV, Zhang L. FastViromeExplorer-Novel: Recovering Draft Genomes of Novel Viruses and Phages in Metagenomic Data. JOURNAL OF COMPUTATIONAL BIOLOGY : A JOURNAL OF COMPUTATIONAL MOLECULAR CELL BIOLOGY 2023; 30:391-408. [PMID: 36607772 DOI: 10.1089/cmb.2022.0397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Despite the recent surge of viral metagenomic studies, recovering complete virus/phage genomes from metagenomic data is still extremely difficult and most viral contigs generated from de novo assembly programs are highly fragmented, posing serious challenges to downstream analysis and inference. In this study, we develop FastViromeExplorer (FVE)-novel, a computational pipeline for reconstructing complete or near-complete viral draft genomes from metagenomic data. The FVE-novel deploys FVE to efficiently map metagenomic reads to viral reference genomes, performs de novo assembly of the mapped reads to generate contigs, and extends the contigs through iterative assembly to produce final viral scaffolds. We applied FVE-novel to an ocean metagenomic sample and obtained 268 viral scaffolds that potentially come from novel viruses. Through manual examination and validation of the 10 longest scaffolds, we successfully recovered 4 complete viral genomes, 2 are novel as they cannot be found in the existing databases and the other 2 are related to known phages. This hybrid reference-based and de novo assembly approach used by FVE-novel represents a powerful new approach for uncovering near-complete viral genomes in metagenomic data.
Collapse
Affiliation(s)
| | - Frank O Aylward
- Department of Biological Sciences, Virginia Tech, Blacksburg, Virginia, USA
| | - Roderick V Jensen
- Department of Biological Sciences, Virginia Tech, Blacksburg, Virginia, USA
| | - Liqing Zhang
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia, USA
| |
Collapse
|
6
|
Gregori J, Colomer-Castell S, Campos C, Ibañez-Lligoña M, Garcia-Cehic D, Rando-Segura A, Adombie CM, Pintó R, Guix S, Bosch A, Domingo E, Gallego I, Perales C, Cortese MF, Tabernero D, Buti M, Riveiro-Barciela M, Esteban JI, Rodriguez-Frias F, Quer J. Quasispecies Fitness Partition to Characterize the Molecular Status of a Viral Population. Negative Effect of Early Ribavirin Discontinuation in a Chronically Infected HEV Patient. Int J Mol Sci 2022; 23:14654. [PMID: 36498981 PMCID: PMC9739305 DOI: 10.3390/ijms232314654] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 11/11/2022] [Accepted: 11/17/2022] [Indexed: 11/25/2022] Open
Abstract
The changes occurring in viral quasispecies populations during infection have been monitored using diversity indices, nucleotide diversity, and several other indices to summarize the quasispecies structure in a single value. In this study, we present a method to partition quasispecies haplotypes into four fractions according to their fitness: the master haplotype, rare haplotypes at two levels (those present at <0.1%, and those at 0.1−1%), and a fourth fraction that we term emerging haplotypes, present at frequencies >1%, but less than that of the master haplotype. We propose that by determining the changes occurring in the volume of the four quasispecies fitness fractions together with those of the Hill number profile we will be able to visualize and analyze the molecular changes in the composition of a quasispecies with time. To develop this concept, we used three data sets: a technical clone of the complete SARS-CoV-2 spike gene, a subset of data previously used in a study of rare haplotypes, and data from a clinical follow-up study of a patient chronically infected with HEV and treated with ribavirin. The viral response to ribavirin mutagenic treatment was selection of a rich set of synonymous haplotypes. The mutation spectrum was very complex at the nucleotide level, but at the protein (phenotypic/functional) level the pattern differed, showing a highly prevalent master phenotype. We discuss the putative implications of this observation in relation to mutagenic antiviral treatment.
Collapse
Affiliation(s)
- Josep Gregori
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
| | - Sergi Colomer-Castell
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry and Molecular Biology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Cerdanyola del Vallès, Spain
| | - Carolina Campos
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry and Molecular Biology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Cerdanyola del Vallès, Spain
| | - Marta Ibañez-Lligoña
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
| | - Damir Garcia-Cehic
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
| | - Ariadna Rando-Segura
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Microbiology Department, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
| | - Caroline Melanie Adombie
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Institute of Agropastoral Management, University Peleforo Gon Coulibaly, Korhogo BP 1328, Côte d’Ivoire
| | - Rosa Pintó
- Enteric Virus Laboratory, Section of Microbiology, Virology and Biotechnology, Department of Genetics, Microbiology and Statistics, School of Biology, University of Barcelona, 08028 Barcelona, Spain
- Enteric Virus Laboratory, Institute of Nutrition and Food Safety (INSA), University of Barcelona, 08028 Barcelona, Spain
| | - Susanna Guix
- Enteric Virus Laboratory, Section of Microbiology, Virology and Biotechnology, Department of Genetics, Microbiology and Statistics, School of Biology, University of Barcelona, 08028 Barcelona, Spain
- Enteric Virus Laboratory, Institute of Nutrition and Food Safety (INSA), University of Barcelona, 08028 Barcelona, Spain
| | - Albert Bosch
- Enteric Virus Laboratory, Section of Microbiology, Virology and Biotechnology, Department of Genetics, Microbiology and Statistics, School of Biology, University of Barcelona, 08028 Barcelona, Spain
- Enteric Virus Laboratory, Institute of Nutrition and Food Safety (INSA), University of Barcelona, 08028 Barcelona, Spain
| | - Esteban Domingo
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Centro de Biología Molecular “Severo Ochoa” (CBMSO, CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain
| | - Isabel Gallego
- Centro de Biología Molecular “Severo Ochoa” (CBMSO, CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain
| | - Celia Perales
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Centro de Biología Molecular “Severo Ochoa” (CBMSO, CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM) Av. Reyes Católicos 2, 28040 Madrid, Spain
| | - Maria Francesca Cortese
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry Department, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
| | - David Tabernero
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry Department, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
| | - Maria Buti
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Medicine Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
| | - Mar Riveiro-Barciela
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Medicine Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
| | - Juan Ignacio Esteban
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Medicine Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Bellaterra, Spain
| | - Francisco Rodriguez-Frias
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry Department, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
| | - Josep Quer
- Liver Diseases-Viral Hepatitis, Liver Unit, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, Av. Monforte de Lemos, 3-5, 28029 Madrid, Spain
- Biochemistry and Molecular Biology Department, Universitat Autònoma de Barcelona (UAB), Campus de la UAB, Plaça Cívica, 08193 Cerdanyola del Vallès, Spain
| |
Collapse
|
7
|
Cai D, Shang J, Sun Y. HaploDMF: viral haplotype reconstruction from long reads via deep matrix factorization. Bioinformatics 2022; 38:5360-5367. [PMID: 36308467 PMCID: PMC9750122 DOI: 10.1093/bioinformatics/btac708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 10/06/2022] [Accepted: 10/25/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Lacking strict proofreading mechanisms, many RNA viruses can generate progeny with slightly changed genomes. Being able to characterize highly similar genomes (i.e. haplotypes) in one virus population helps study the viruses' evolution and their interactions with the host/other microbes. High-throughput sequencing data has become the major source for characterizing viral populations. However, the inherent limitation on read length by next-generation sequencing makes complete haplotype reconstruction difficult. RESULTS In this work, we present a new tool named HaploDMF that can construct complete haplotypes using third-generation sequencing (TGS) data. HaploDMF utilizes a deep matrix factorization model with an adapted loss function to learn latent features from aligned reads automatically. The latent features are then used to cluster reads of the same haplotype. Unlike existing tools whose performance can be affected by the overlap size between reads, HaploDMF is able to achieve highly robust performance on data with different coverage, haplotype number and error rates. In particular, it can generate more complete haplotypes even when the sequencing coverage drops in the middle. We benchmark HaploDMF against the state-of-the-art tools on simulated and real sequencing TGS data on different viruses. The results show that HaploDMF competes favorably against all others. AVAILABILITY AND IMPLEMENTATION The source code and the documentation of HaploDMF are available at https://github.com/dhcai21/HaploDMF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Yanni Sun
- To whom correspondence should be addressed.
| |
Collapse
|
8
|
Martínez-González B, Soria ME, Vázquez-Sirvent L, Ferrer-Orta C, Lobo-Vega R, Mínguez P, de la Fuente L, Llorens C, Soriano B, Ramos-Ruíz R, Cortón M, López-Rodríguez R, García-Crespo C, Somovilla P, Durán-Pastor A, Gallego I, de Ávila AI, Delgado S, Morán F, López-Galíndez C, Gómez J, Enjuanes L, Salar-Vidal L, Esteban-Muñoz M, Esteban J, Fernández-Roblas R, Gadea I, Ayuso C, Ruíz-Hornillos J, Verdaguer N, Domingo E, Perales C. SARS-CoV-2 Mutant Spectra at Different Depth Levels Reveal an Overwhelming Abundance of Low Frequency Mutations. Pathogens 2022; 11:662. [PMID: 35745516 PMCID: PMC9227345 DOI: 10.3390/pathogens11060662] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Revised: 06/02/2022] [Accepted: 06/06/2022] [Indexed: 12/23/2022] Open
Abstract
Populations of RNA viruses are composed of complex and dynamic mixtures of variant genomes that are termed mutant spectra or mutant clouds. This applies also to SARS-CoV-2, and mutations that are detected at low frequency in an infected individual can be dominant (represented in the consensus sequence) in subsequent variants of interest or variants of concern. Here we briefly review the main conclusions of our work on mutant spectrum characterization of hepatitis C virus (HCV) and SARS-CoV-2 at the nucleotide and amino acid levels and address the following two new questions derived from previous results: (i) how is the SARS-CoV-2 mutant and deletion spectrum composition in diagnostic samples, when examined at progressively lower cut-off mutant frequency values in ultra-deep sequencing; (ii) how the frequency distribution of minority amino acid substitutions in SARS-CoV-2 compares with that of HCV sampled also from infected patients. The main conclusions are the following: (i) the number of different mutations found at low frequency in SARS-CoV-2 mutant spectra increases dramatically (50- to 100-fold) as the cut-off frequency for mutation detection is lowered from 0.5% to 0.1%, and (ii) that, contrary to HCV, SARS-CoV-2 mutant spectra exhibit a deficit of intermediate frequency amino acid substitutions. The possible origin and implications of mutant spectrum differences among RNA viruses are discussed.
Collapse
Affiliation(s)
- Brenda Martínez-González
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (B.M.-G.); (M.E.S.); (L.V.-S.); (R.L.-V.); (L.S.-V.); (M.E.-M.); (J.E.); (R.F.-R.); (I.G.)
- Department of Molecular and Cell Biology, Centro Nacional de Biotecnología (CNB-CSIC), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain;
| | - María Eugenia Soria
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (B.M.-G.); (M.E.S.); (L.V.-S.); (R.L.-V.); (L.S.-V.); (M.E.-M.); (J.E.); (R.F.-R.); (I.G.)
- Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (P.S.); (A.D.-P.); (I.G.); (A.I.d.Á.)
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, 28029 Madrid, Spain;
| | - Lucía Vázquez-Sirvent
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (B.M.-G.); (M.E.S.); (L.V.-S.); (R.L.-V.); (L.S.-V.); (M.E.-M.); (J.E.); (R.F.-R.); (I.G.)
| | - Cristina Ferrer-Orta
- Structural Biology Department, Institut de Biología Molecular de Barcelona CSIC, 08028 Barcelona, Spain; (C.F.-O.); (N.V.)
| | - Rebeca Lobo-Vega
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (B.M.-G.); (M.E.S.); (L.V.-S.); (R.L.-V.); (L.S.-V.); (M.E.-M.); (J.E.); (R.F.-R.); (I.G.)
| | - Pablo Mínguez
- Department of Genetics & Genomics, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (P.M.); (L.d.l.F.); (M.C.); (R.L.-R.); (C.A.)
- Centre for Biomedical Network Research on Rare Diseases (CIBERER), Instituto de Salud Carlos III, 28029 Madrid, Spain
- Bioinformatics Unit, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), 28040 Madrid, Spain
| | - Lorena de la Fuente
- Department of Genetics & Genomics, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (P.M.); (L.d.l.F.); (M.C.); (R.L.-R.); (C.A.)
- Centre for Biomedical Network Research on Rare Diseases (CIBERER), Instituto de Salud Carlos III, 28029 Madrid, Spain
- Bioinformatics Unit, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), 28040 Madrid, Spain
| | - Carlos Llorens
- Biotechvana, “Scientific Park”, Universidad de Valencia, 46980 Valencia, Spain; (C.L.); (B.S.)
| | - Beatriz Soriano
- Biotechvana, “Scientific Park”, Universidad de Valencia, 46980 Valencia, Spain; (C.L.); (B.S.)
| | - Ricardo Ramos-Ruíz
- Unidad de Genómica, “Scientific Park of Madrid”, Campus de Cantoblanco, 28049 Madrid, Spain;
| | - Marta Cortón
- Department of Genetics & Genomics, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (P.M.); (L.d.l.F.); (M.C.); (R.L.-R.); (C.A.)
- Centre for Biomedical Network Research on Rare Diseases (CIBERER), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Rosario López-Rodríguez
- Department of Genetics & Genomics, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (P.M.); (L.d.l.F.); (M.C.); (R.L.-R.); (C.A.)
- Centre for Biomedical Network Research on Rare Diseases (CIBERER), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Carlos García-Crespo
- Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (P.S.); (A.D.-P.); (I.G.); (A.I.d.Á.)
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, 28029 Madrid, Spain;
| | - Pilar Somovilla
- Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (P.S.); (A.D.-P.); (I.G.); (A.I.d.Á.)
- Departamento de Biología Molecular, Universidad Autónoma de Madrid, Campus de Cantoblanco, 28049 Madrid, Spain
| | - Antoni Durán-Pastor
- Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (P.S.); (A.D.-P.); (I.G.); (A.I.d.Á.)
| | - Isabel Gallego
- Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (P.S.); (A.D.-P.); (I.G.); (A.I.d.Á.)
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, 28029 Madrid, Spain;
| | - Ana Isabel de Ávila
- Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (P.S.); (A.D.-P.); (I.G.); (A.I.d.Á.)
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, 28029 Madrid, Spain;
| | - Soledad Delgado
- Departamento de Sistemas Informáticos, Escuela Técnica Superior de Ingeniería de Sistemas Informáticos (ETSISI), Universidad Politécnica de Madrid, 28031 Madrid, Spain;
| | - Federico Morán
- Departamento de Bioquímica y Biología Molecular, Universidad Complutense de Madrid, 28005 Madrid, Spain;
| | - Cecilio López-Galíndez
- Unidad de Virología Molecular, Laboratorio de Referencia e Investigación en Retrovirus, Centro Nacional de Microbiología, Instituto de Salud Carlos III, Majadahonda, 28222 Madrid, Spain;
| | - Jordi Gómez
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, 28029 Madrid, Spain;
- Instituto de Parasitología y Biomedicina ‘López-Neyra’ (CSIC), Parque Tecnológico Ciencias de la Salud, Armilla, 18016 Granada, Spain
| | - Luis Enjuanes
- Department of Molecular and Cell Biology, Centro Nacional de Biotecnología (CNB-CSIC), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain;
| | - Llanos Salar-Vidal
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (B.M.-G.); (M.E.S.); (L.V.-S.); (R.L.-V.); (L.S.-V.); (M.E.-M.); (J.E.); (R.F.-R.); (I.G.)
| | - Mario Esteban-Muñoz
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (B.M.-G.); (M.E.S.); (L.V.-S.); (R.L.-V.); (L.S.-V.); (M.E.-M.); (J.E.); (R.F.-R.); (I.G.)
| | - Jaime Esteban
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (B.M.-G.); (M.E.S.); (L.V.-S.); (R.L.-V.); (L.S.-V.); (M.E.-M.); (J.E.); (R.F.-R.); (I.G.)
| | - Ricardo Fernández-Roblas
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (B.M.-G.); (M.E.S.); (L.V.-S.); (R.L.-V.); (L.S.-V.); (M.E.-M.); (J.E.); (R.F.-R.); (I.G.)
| | - Ignacio Gadea
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (B.M.-G.); (M.E.S.); (L.V.-S.); (R.L.-V.); (L.S.-V.); (M.E.-M.); (J.E.); (R.F.-R.); (I.G.)
| | - Carmen Ayuso
- Department of Genetics & Genomics, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (P.M.); (L.d.l.F.); (M.C.); (R.L.-R.); (C.A.)
- Centre for Biomedical Network Research on Rare Diseases (CIBERER), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Javier Ruíz-Hornillos
- Allergy Unit, Hospital Infanta Elena, Valdemoro, 28342 Madrid, Spain;
- Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain
- Faculty of Medicine, Universidad Francisco de Vitoria, 28223 Madrid, Spain
| | - Nuria Verdaguer
- Structural Biology Department, Institut de Biología Molecular de Barcelona CSIC, 08028 Barcelona, Spain; (C.F.-O.); (N.V.)
| | - Esteban Domingo
- Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (P.S.); (A.D.-P.); (I.G.); (A.I.d.Á.)
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, 28029 Madrid, Spain;
| | - Celia Perales
- Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Av. Reyes Católicos 2, 28040 Madrid, Spain; (B.M.-G.); (M.E.S.); (L.V.-S.); (R.L.-V.); (L.S.-V.); (M.E.-M.); (J.E.); (R.F.-R.); (I.G.)
- Department of Molecular and Cell Biology, Centro Nacional de Biotecnología (CNB-CSIC), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain;
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Instituto de Salud Carlos III, 28029 Madrid, Spain;
| |
Collapse
|
9
|
Cai D, Sun Y. Reconstructing viral haplotypes using long reads. Bioinformatics 2022; 38:2127-2134. [PMID: 35157018 DOI: 10.1093/bioinformatics/btac089] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 01/19/2022] [Accepted: 02/08/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Most RNA viruses lack strict proofreading during replication. Coupled with a high replication rate, some RNA viruses can form a virus population containing a group of genetically related but different haplotypes. Characterizing the haplotype composition in a virus population is thus important to understand viruses' evolution. Many attempts have been made to reconstruct viral haplotypes using next-generation sequencing (NGS) reads. However, the short length of NGS reads cannot cover distant single-nucleotide variants, making it difficult to reconstruct complete or near-complete haplotypes. Given the fast developments of third-generation sequencing technologies, a new opportunity has arisen for reconstructing full-length haplotypes with long reads. RESULTS In this work, we developed a new tool, RVHaplo to reconstruct haplotypes for known viruses from long reads. We tested it rigorously on both simulated and real viral sequencing data and compared it against other popular haplotype reconstruction tools. The results demonstrated that RVHaplo outperforms the state-of-the-art tools for viral haplotype reconstruction from long reads. Especially, RVHaplo can reconstruct the rare (1% abundance) haplotypes that other tools usually missed. AVAILABILITY AND IMPLEMENTATION The source code and the documentation of RVHaplo are available at https://github.com/dhcai21/RVHaplo. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| |
Collapse
|
10
|
Valieris R, Drummond RD, Defelicibus A, Dias-Neto E, Rosales RA, Tojal da Silva I. A mixture model for determining SARS-Cov-2 variant composition in pooled samples. Bioinformatics 2022; 38:1809-1815. [PMID: 35104309 DOI: 10.1093/bioinformatics/btac047] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Revised: 12/14/2021] [Accepted: 01/26/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Despite of the fast development of highly effective vaccines to control the current COVID-19 pandemics, the unequal distribution and availability of these vaccines worldwide and the number of people infected in the world lead to the continuous emergence of Severe Acute Respiratory Syndrome coronavirus 2 (SARS-CoV-2) variants of concern. Therefore, it is likely that real-time genomic surveillance will be continuously needed as an unceasing monitoring tool, necessary to follow the spread of the disease and the evolution of the virus. In this context, new genomic variants of SARS-CoV-2, including variants refractory to current vaccines, makes genomic surveillance programs tools of utmost importance. Nevertheless, the lack of appropriate analytical tools to quickly and effectively access the viral composition in meta-transcriptomic sequencing data, including environmental surveillance, represent possible challenges that may impact the fast adoption of this approach to mitigate the spread and transmission of viruses. RESULTS We propose a statistical model for the estimation of the relative frequencies of SARS-CoV-2 variants in pooled samples. This model is built by considering a previously defined selection of genomic polymorphisms that characterize SARS-CoV-2 variants. The methods described here support both raw sequencing reads for polymorphisms-based markers calling and predefined markers in the variant call format. Results obtained using simulated data show that our method is quite effective in recovering the correct variant proportions. Further, results obtained by considering longitudinal data from wastewater samples of two locations in Switzerland agree well with those describing the epidemiological evolution of COVID-19 variants in clinical samples of these locations. Our results show that the described method can be a valuable tool for tracking the proportions of SARS-CoV-2 variants in complex mixtures such as waste water and environmental samples. AVAILABILITY AND IMPLEMENTATION http://github.com/rvalieris/LCS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Renan Valieris
- Laboratory of Computational Biology and Bioinformatics, CIPE/A.C. Camargo Cancer Center, São Paulo 01508-010, Brazil
| | - Rodrigo D Drummond
- Laboratory of Computational Biology and Bioinformatics, CIPE/A.C. Camargo Cancer Center, São Paulo 01508-010, Brazil
| | - Alexandre Defelicibus
- Laboratory of Computational Biology and Bioinformatics, CIPE/A.C. Camargo Cancer Center, São Paulo 01508-010, Brazil
| | - Emmanuel Dias-Neto
- Laboratory of Medical Genomics, CIPE/A.C. Camargo Cancer Center, São Paulo 01508-010, Brazil
| | - Rafael A Rosales
- Departamento de Computação e Matemática, Universidade de São Paulo, Ribeirão Preto, São Paulo 14040-901, Brazil
| | - Israel Tojal da Silva
- Laboratory of Computational Biology and Bioinformatics, CIPE/A.C. Camargo Cancer Center, São Paulo 01508-010, Brazil
| |
Collapse
|
11
|
Liao H, Cai D, Sun Y. VirStrain: a strain identification tool for RNA viruses. Genome Biol 2022; 23:38. [PMID: 35101081 PMCID: PMC8801933 DOI: 10.1186/s13059-022-02609-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 01/12/2022] [Indexed: 12/18/2022] Open
Abstract
Viruses change constantly during replication, leading to high intra-species diversity. Although many changes are neutral or deleterious, some can confer on the virus different biological properties such as better adaptability. In addition, viral genotypes often have associated metadata, such as host residence, which can help with inferring viral transmission during pandemics. Thus, subspecies analysis can provide important insights into virus characterization. Here, we present VirStrain, a tool taking short reads as input with viral strain composition as output. We rigorously test VirStrain on multiple simulated and real virus sequencing datasets. VirStrain outperforms the state-of-the-art tools in both sensitivity and accuracy.
Collapse
Affiliation(s)
- Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, China.
| |
Collapse
|
12
|
Abstract
Viral quasispecies are dynamic distributions of nonidentical but closely related mutant and recombinant viral genomes subjected to a continuous process of genetic variation, competition, and selection that may act as a unit of selection. The quasispecies concept owes its theoretical origins to a model for the origin of life as a collection of mutant RNA replicators. Independently, experimental evidence for the quasispecies concept was obtained from sampling of bacteriophage clones, which revealed that the viral populations consisted of many mutant genomes whose frequency varied with time of replication. Similar findings were made in animal and plant RNA viruses. Quasispecies became a theoretical framework to understand viral population dynamics and adaptability. The evidence came at a time when mutations were considered rare events in genetics, a perception that was to change dramatically in subsequent decades. Indeed, viral quasispecies was the conceptual forefront of a remarkable degree of biological diversity, now evident for cell populations and organisms, not only for viruses. Quasispecies dynamics unveiled complexities in the behavior of viral populations,with consequences for disease mechanisms and control strategies. This review addresses the origin of the quasispecies concept, its major implications on both viral evolution and antiviral strategies, and current and future prospects.
Collapse
Affiliation(s)
- Esteban Domingo
- Department of Interactions with the Environment, Centro de Biología Molecular Severo Ochoa (CBMSO), Consejo Superior de Investigaciones Científicas (CSIC), 28049 Madrid, Spain; .,Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Carlos García-Crespo
- Department of Interactions with the Environment, Centro de Biología Molecular Severo Ochoa (CBMSO), Consejo Superior de Investigaciones Científicas (CSIC), 28049 Madrid, Spain;
| | - Celia Perales
- Department of Interactions with the Environment, Centro de Biología Molecular Severo Ochoa (CBMSO), Consejo Superior de Investigaciones Científicas (CSIC), 28049 Madrid, Spain; .,Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain.,Department of Clinical Microbiology, Instituto de Investigación Sanitaria-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), 28040 Madrid, Spain
| |
Collapse
|
13
|
Pelizzola M, Behr M, Li H, Munk A, Futschik A. Multiple haplotype reconstruction from allele frequency data. NATURE COMPUTATIONAL SCIENCE 2021; 1:262-271. [PMID: 38217170 DOI: 10.1038/s43588-021-00056-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Accepted: 03/12/2021] [Indexed: 01/15/2024]
Abstract
Because haplotype information is of widespread interest in biomedical applications, effort has been put into their reconstruction. Here, we propose an efficient method, called haploSep, that is able to accurately infer major haplotypes and their frequencies just from multiple samples of allele frequency data. Even the accuracy of experimentally obtained allele frequencies can be improved by re-estimating them from our reconstructed haplotypes. From a methodological point of view, we model our problem as a multivariate regression problem where both the design matrix and the coefficient matrix are unknown. Compared to other methods, haploSep is very fast, with linear computational complexity in the haplotype length. We illustrate our method on simulated and real data focusing on experimental evolution and microbial data.
Collapse
Affiliation(s)
- Marta Pelizzola
- Vetmeduni Vienna, Vienna, Austria
- Vienna Graduate School of Population Genetics, Vienna, Austria
| | - Merle Behr
- University of California, Berkeley, CA, USA
| | - Housen Li
- University of Göttingen, Göttingen, Germany
- Cluster of Excellence 'Multiscale Bioimaging: from Molecular Machines to Networks of Excitable Cells' (MBExC), University of Göttingen, Göttingen, Germany
| | - Axel Munk
- University of Göttingen, Göttingen, Germany
- Cluster of Excellence 'Multiscale Bioimaging: from Molecular Machines to Networks of Excitable Cells' (MBExC), University of Göttingen, Göttingen, Germany
- Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | | |
Collapse
|
14
|
Cao C, He J, Mak L, Perera D, Kwok D, Wang J, Li M, Mourier T, Gavriliuc S, Greenberg M, Morrissy AS, Sycuro LK, Yang G, Jeffares DC, Long Q. Reconstruction of Microbial Haplotypes by Integration of Statistical and Physical Linkage in Scaffolding. Mol Biol Evol 2021; 38:2660-2672. [PMID: 33547786 PMCID: PMC8136496 DOI: 10.1093/molbev/msab037] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
DNA sequencing technologies provide unprecedented opportunities to analyze within-host evolution of microorganism populations. Often, within-host populations are analyzed via pooled sequencing of the population, which contains multiple individuals or "haplotypes." However, current next-generation sequencing instruments, in conjunction with single-molecule barcoded linked-reads, cannot distinguish long haplotypes directly. Computational reconstruction of haplotypes from pooled sequencing has been attempted in virology, bacterial genomics, metagenomics, and human genetics, using algorithms based on either cross-host genetic sharing or within-host genomic reads. Here, we describe PoolHapX, a flexible computational approach that integrates information from both genetic sharing and genomic sequencing. We demonstrated that PoolHapX outperforms state-of-the-art tools tailored to specific organismal systems, and is robust to within-host evolution. Importantly, together with barcoded linked-reads, PoolHapX can infer whole-chromosome-scale haplotypes from 50 pools each containing 12 different haplotypes. By analyzing real data, we uncovered dynamic variations in the evolutionary processes of within-patient HIV populations previously unobserved in single position-based analysis.
Collapse
Affiliation(s)
- Chen Cao
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Jingni He
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Cardiology, Xiangya Hospital, Central South University, Changsha, China
| | - Lauren Mak
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Present address: Tri-Institutional Computational Biology & Medicine Program, Weill Cornell Medicine of Cornell University, New York, NY, USA
| | - Deshan Perera
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Devin Kwok
- Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada
| | - Jia Wang
- Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL, USA
| | - Minghao Li
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Tobias Mourier
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Stefan Gavriliuc
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Matthew Greenberg
- Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada
| | - A Sorana Morrissy
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada
| | - Laura K Sycuro
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Microbiology, Immunology, and Infectious Diseases, Snyder Institute for Chronic Diseases, University of Calgary, Calgary, AB, Canada
| | - Guang Yang
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Medical Genetics, University of Calgary, Calgary, AB, Canada
| | - Daniel C Jeffares
- Department of Biology, York Biomedical Research Institute, University of York, York, United Kingdom
| | - Quan Long
- Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB, Canada,Department of Mathematics & Statistics, University of Calgary, Calgary, AB, Canada,Department of Medical Genetics, University of Calgary, Calgary, AB, Canada,Hotchkiss Brain Institute, O’Brien Institute for Public Health, University of Calgary, Calgary, AB, Canada,Corresponding author: E-mail:
| |
Collapse
|
15
|
Cao C, Greenberg M, Long Q. WgLink: reconstructing whole-genome viral haplotypes using L0+L1-regularization. Bioinformatics 2021; 37:2744-2746. [PMID: 33532820 DOI: 10.1093/bioinformatics/btab076] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 12/23/2020] [Accepted: 01/29/2021] [Indexed: 12/24/2022] Open
Abstract
SUMMARY Many tools can reconstruct viral sequences based on next generation sequencing reads. Although existing tools effectively recover local regions, their accuracy suffers when reconstructing the whole viral genomes (strains). Moreover, they consume significant memory when the sequencing coverage is high or when the genome size is large. We present WgLink to meet this challenge. WgLink takes local reconstructions produced by other tools as input and patches the resulting segments together into coherent whole-genome strains. We accomplish this using an L0+L1-regularized regression synthesizing variant allele frequency data with physical linkage between multiple variants spanning multiple regions simultaneously. WgLink achieves higher accuracy than existing tools both on simulated and real data sets while using significantly less memory (RAM) and fewer CPU hours. AVAILABILITY Source code and binaries are freely available at https://github.com/theLongLab/wglink. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chen Cao
- Department of Biochemistry & Molecular Biology, Alberta Children's Hospital Research Institute, Calgary, AB, T2N 4N1, Canada
| | - Matthew Greenberg
- Department of Mathematics & Statistics, Calgary, AB, T2N 4N1, Canada
| | - Quan Long
- Department of Biochemistry & Molecular Biology, Alberta Children's Hospital Research Institute, Calgary, AB, T2N 4N1, Canada.,Department of Mathematics & Statistics, Calgary, AB, T2N 4N1, Canada.,Department of Medical Genetics, Hotchkiss Brain Institute, University of Calgary, Calgary, AB, T2N 4N1, Canada
| |
Collapse
|
16
|
García-Crespo C, Soria ME, Gallego I, de Ávila AI, Martínez-González B, Vázquez-Sirvent L, Gómez J, Briones C, Gregori J, Quer J, Perales C, Domingo E. Dissimilar Conservation Pattern in Hepatitis C Virus Mutant Spectra, Consensus Sequences, and Data Banks. J Clin Med 2020; 9:jcm9113450. [PMID: 33121037 PMCID: PMC7692060 DOI: 10.3390/jcm9113450] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Revised: 10/15/2020] [Accepted: 10/20/2020] [Indexed: 02/07/2023] Open
Abstract
The influence of quasispecies dynamics on long-term virus diversification in nature is a largely unexplored question. Specifically, whether intra-host nucleotide and amino acid variation in quasispecies fit the variation observed in consensus sequences or data bank alignments is unknown. Genome conservation and dynamics simulations are used for the computational design of universal vaccines, therapeutic antibodies and pan-genomic antiviral agents. The expectation is that selection of escape mutants will be limited when mutations at conserved residues are required. This strategy assumes long-term (epidemiologically relevant) conservation but, critically, does not consider short-term (quasispecies-dictated) residue conservation. We calculated mutant frequencies of individual loci from mutant spectra of hepatitis C virus (HCV) populations passaged in cell culture and from infected patients. Nucleotide or amino acid conservation in consensus sequences of the same populations, or in the Los Alamos HCV data bank did not match residue conservation in mutant spectra. The results relativize the concept of sequence conservation in viral genetics and suggest that residue invariance in data banks is an insufficient basis for the design of universal viral ligands for clinical purposes. Our calculations suggest relaxed mutational restrictions during quasispecies dynamics, which may contribute to higher calculated short-term than long-term viral evolutionary rates.
Collapse
Affiliation(s)
- Carlos García-Crespo
- Department of Interactions with the environment, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (M.E.S.); (I.G.); (A.I.d.Á.); (B.M.-G.); (L.V.-S.)
| | - María Eugenia Soria
- Department of Interactions with the environment, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (M.E.S.); (I.G.); (A.I.d.Á.); (B.M.-G.); (L.V.-S.)
- Department of Clinical Microbiology, IIS-Fundación Jiménez Díaz, UAM. Av. Reyes Católicos 2, 28040 Madrid, Spain
| | - Isabel Gallego
- Department of Interactions with the environment, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (M.E.S.); (I.G.); (A.I.d.Á.); (B.M.-G.); (L.V.-S.)
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain; (J.G.); (C.B.); (J.G.); (J.Q.)
| | - Ana Isabel de Ávila
- Department of Interactions with the environment, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (M.E.S.); (I.G.); (A.I.d.Á.); (B.M.-G.); (L.V.-S.)
| | - Brenda Martínez-González
- Department of Interactions with the environment, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (M.E.S.); (I.G.); (A.I.d.Á.); (B.M.-G.); (L.V.-S.)
- Department of Clinical Microbiology, IIS-Fundación Jiménez Díaz, UAM. Av. Reyes Católicos 2, 28040 Madrid, Spain
| | - Lucía Vázquez-Sirvent
- Department of Interactions with the environment, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (M.E.S.); (I.G.); (A.I.d.Á.); (B.M.-G.); (L.V.-S.)
| | - Jordi Gómez
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain; (J.G.); (C.B.); (J.G.); (J.Q.)
- Department of Molecular Biology, Instituto de Parasitología y Biomedicina ‘López-Neyra’ (CSIC), Parque Tecnológico Ciencias de la Salud, Armilla, 18016 Granada, Spain
| | - Carlos Briones
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain; (J.G.); (C.B.); (J.G.); (J.Q.)
- Department of Molecular Evolution, Centro de Astrobiología (CAB, CSIC-INTA), Torrejón de Ardoz, 28850 Madrid, Spain
| | - Josep Gregori
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain; (J.G.); (C.B.); (J.G.); (J.Q.)
- Liver Unit, Liver Diseases—Viral Hepatitis, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
- Roche Diagnostics, S.L., Sant Cugat del Vallés, 08174 Barcelona, Spain
| | - Josep Quer
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain; (J.G.); (C.B.); (J.G.); (J.Q.)
- Liver Unit, Liver Diseases—Viral Hepatitis, Vall d’Hebron Institut de Recerca (VHIR), Vall d’Hebron Hospital Universitari, Vall d’Hebron Barcelona Hospital Campus, Passeig Vall d’Hebron 119-129, 08035 Barcelona, Spain
| | - Celia Perales
- Department of Interactions with the environment, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (M.E.S.); (I.G.); (A.I.d.Á.); (B.M.-G.); (L.V.-S.)
- Department of Clinical Microbiology, IIS-Fundación Jiménez Díaz, UAM. Av. Reyes Católicos 2, 28040 Madrid, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain; (J.G.); (C.B.); (J.G.); (J.Q.)
- Correspondence: or (C.P.); (E.D.)
| | - Esteban Domingo
- Department of Interactions with the environment, Centro de Biología Molecular “Severo Ochoa” (CSIC-UAM), Consejo Superior de Investigaciones Científicas (CSIC), Campus de Cantoblanco, 28049 Madrid, Spain; (C.G.-C.); (M.E.S.); (I.G.); (A.I.d.Á.); (B.M.-G.); (L.V.-S.)
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain; (J.G.); (C.B.); (J.G.); (J.Q.)
- Correspondence: or (C.P.); (E.D.)
| |
Collapse
|
17
|
Eliseev A, Gibson KM, Avdeyev P, Novik D, Bendall ML, Pérez-Losada M, Alexeev N, Crandall KA. Evaluation of haplotype callers for next-generation sequencing of viruses. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 82:104277. [PMID: 32151775 PMCID: PMC7293574 DOI: 10.1016/j.meegid.2020.104277] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 03/04/2020] [Accepted: 03/06/2020] [Indexed: 01/30/2023]
Abstract
Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.
Collapse
Affiliation(s)
- Anton Eliseev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
| | - Pavel Avdeyev
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA
| | - Dmitry Novik
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal
| | - Nikita Alexeev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| |
Collapse
|
18
|
Kadoya SS, Urayama SI, Nunoura T, Hirai M, Takaki Y, Kitajima M, Nakagomi T, Nakagomi O, Okabe S, Nishimura O, Sano D. Bottleneck Size-Dependent Changes in the Genetic Diversity and Specific Growth Rate of a Rotavirus A Strain. J Virol 2020; 94:e02083-19. [PMID: 32132235 PMCID: PMC7199400 DOI: 10.1128/jvi.02083-19] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Accepted: 02/21/2020] [Indexed: 12/24/2022] Open
Abstract
RNA viruses form a dynamic distribution of mutant swarms (termed "quasispecies") due to the accumulation of mutations in the viral genome. The genetic diversity of a viral population is affected by several factors, including a bottleneck effect. Human-to-human transmission exemplifies a bottleneck effect, in that only part of a viral population can reach the next susceptible hosts. In the present study, two lineages of the rhesus rotavirus (RRV) strain of rotavirus A were serially passaged five times at a multiplicity of infection (MOI) of 0.1 or 0.001, and three phenotypes (infectious titer, cell binding ability, and specific growth rate) were used to evaluate the impact of a bottleneck effect on the RRV population. The specific growth rate values of lineages passaged under the stronger bottleneck (MOI of 0.001) were higher after five passages. The nucleotide diversity also increased, which indicated that the mutant swarms of the lineages under the stronger bottleneck effect were expanded through the serial passages. The random distribution of synonymous and nonsynonymous substitutions on rotavirus genome segments indicated that almost all mutations were selectively neutral. Simple simulations revealed that the presence of minor mutants could influence the specific growth rate of a population in a mutant frequency-dependent manner. These results indicate a stronger bottleneck effect can create more sequence spaces for minor sequences.IMPORTANCE In this study, we investigated a bottleneck effect on an RRV population that may drastically affect the viral population structure. RRV populations were serially passaged under two levels of a bottleneck effect, which exemplified human-to-human transmission. As a result, the genetic diversity and specific growth rate of RRV populations increased under the stronger bottleneck effect, which implied that a bottleneck created a new space in a population for minor mutants originally existing in a hidden layer, which includes minor mutations that cannot be distinguished from a sequencing error. The results of this study suggest that the genetic drift caused by a bottleneck in human-to-human transmission explains the random appearance of new genetic lineages causing viral outbreaks, which can be expected according to molecular epidemiology using next-generation sequencing in which the viral genetic diversity within a viral population is investigated.
Collapse
Affiliation(s)
- Syun-Suke Kadoya
- Department of Civil and Environmental Engineering, Graduate School of Engineering, Tohoku University, Aoba-ku, Sendai, Miyagi, Japan
| | - Syun-Ichi Urayama
- Graduate School of Life and Environmental Sciences, University of Tsukuba, Tsukuba, Ibaraki, Japan
- Research Center for Bioscience and Nanoscience (CeBN), Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Kanagawa, Japan
| | - Takuro Nunoura
- Research Center for Bioscience and Nanoscience (CeBN), Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Kanagawa, Japan
| | - Miho Hirai
- Super-cutting-edge Grand and Advanced Research (SUGAR) Program, Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Kanagawa, Japan
| | - Yoshihiro Takaki
- Super-cutting-edge Grand and Advanced Research (SUGAR) Program, Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Kanagawa, Japan
| | - Masaaki Kitajima
- Division of Environmental Engineering, Faculty of Engineering, Hokkaido University, Sapporo, Hokkaido, Japan
| | - Toyoko Nakagomi
- Department of Molecular Microbiology and Immunology, Nagasaki University, Nagasaki, Japan
| | - Osamu Nakagomi
- Department of Molecular Microbiology and Immunology, Nagasaki University, Nagasaki, Japan
| | - Satoshi Okabe
- Division of Environmental Engineering, Faculty of Engineering, Hokkaido University, Sapporo, Hokkaido, Japan
| | - Osamu Nishimura
- Department of Civil and Environmental Engineering, Graduate School of Engineering, Tohoku University, Aoba-ku, Sendai, Miyagi, Japan
| | - Daisuke Sano
- Department of Civil and Environmental Engineering, Graduate School of Engineering, Tohoku University, Aoba-ku, Sendai, Miyagi, Japan
- Department of Environmental Studies, Tohoku University, Aoba-ku, Sendai, Miyagi, Japan
| |
Collapse
|
19
|
Chen J, Shang J, Wang J, Sun Y. A binning tool to reconstruct viral haplotypes from assembled contigs. BMC Bioinformatics 2019; 20:544. [PMID: 31684876 PMCID: PMC6829986 DOI: 10.1186/s12859-019-3138-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Accepted: 10/09/2019] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Infections by RNA viruses such as Influenza, HIV still pose a serious threat to human health despite extensive research on viral diseases. One challenge for producing effective prevention and treatment strategies is high intra-species genetic diversity. As different strains may have different biological properties, characterizing the genetic diversity is thus important to vaccine and drug design. Next-generation sequencing technology enables comprehensive characterization of both known and novel strains and has been widely adopted for sequencing viral populations. However, genome-scale reconstruction of haplotypes is still a challenging problem. In particular, haplotype assembly programs often produce contigs rather than full genomes. As a mutation in one gene can mask the phenotypic effects of a mutation at another locus, clustering these contigs into genome-scale haplotypes is still needed. RESULTS We developed a contig binning tool, VirBin, which clusters contigs into different groups so that each group represents a haplotype. Commonly used features based on sequence composition and contig coverage cannot effectively distinguish viral haplotypes because of their high sequence similarity and heterogeneous sequencing coverage for RNA viruses. VirBin applied prototype-based clustering to cluster regions that are more likely to contain mutations specific to a haplotype. The tool was tested on multiple simulated sequencing data with different haplotype abundance distributions and contig sizes, and also on mock quasispecies sequencing data. The benchmark results with other contig binning tools demonstrated the superior sensitivity and precision of VirBin in contig binning for viral haplotype reconstruction. CONCLUSIONS In this work, we presented VirBin, a new contig binning tool for distinguishing contigs from different viral haplotypes with high sequence similarity. It competes favorably with other tools on viral contig binning. The source codes are available at: https://github.com/chjiao/VirBin .
Collapse
Affiliation(s)
- Jiao Chen
- Computer Science and Engineering, Michigan State University, East Lansing, 48824, USA
| | - Jiayu Shang
- Electrical Engineering, City University of Hong Kong, Hong Kong, China
| | - Jianrong Wang
- Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, 48824, USA
| | - Yanni Sun
- Electrical Engineering, City University of Hong Kong, Hong Kong, China.
| |
Collapse
|