51
|
Jayasundara D, Saeed I, Chang BC, Tang SL, Halgamuge SK. Accurate reconstruction of viral quasispecies spectra through improved estimation of strain richness. BMC Bioinformatics 2015; 16 Suppl 18:S3. [PMID: 26678073 PMCID: PMC4682401 DOI: 10.1186/1471-2105-16-s18-s3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND Estimating the number of different species (richness) in a mixed microbial population has been a main focus in metagenomic research. Existing methods of species richness estimation ride on the assumption that the reads in each assembled contig correspond to only one of the microbial genomes in the population. This assumption and the underlying probabilistic formulations of existing methods are not useful for quasispecies populations where the strains are highly genetically related. RESULTS On benchmark data sets, our estimation method provided accurate richness estimates (< 0.2 median estimation error) and improved the precision of ViQuaS by 2%-13% and F-score by 1%-9% without compromising the recall rates. We also demonstrate that our estimation method can be used to improve the precision and F-score of ShoRAH by 0%-7% and 0%-5% respectively. CONCLUSIONS The proposed probabilistic estimation method can be used to estimate the richness of viral populations with a quasispecies behavior and to improve the accuracy of the quasispecies spectra reconstructed by the existing methods ViQuaS and ShoRAH in the presence of a moderate level of technical sequencing errors. AVAILABILITY http://sourceforge.net/projects/viquas/.
Collapse
Affiliation(s)
- Duleepa Jayasundara
- Optimisation and Pattern Recognition Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Parkville, Australia
| | - I Saeed
- Optimisation and Pattern Recognition Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Parkville, Australia
| | - BC Chang
- Yourgene Bioscience, No. 376-5, Fuxing Rd., Shu-Lin District, New Taipei City, Taiwan
| | - Sen-Lin Tang
- Biodiversity Research Center, Academia Sinica, Taipei 11529, Nan-Kang, Taiwan
| | - Saman K Halgamuge
- Optimisation and Pattern Recognition Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Parkville, Australia
| |
Collapse
|
52
|
Abstract
Real-time PCR is the traditional face of nucleic acid detection in the diagnostic microbiology laboratory and is now generally regarded as robust enough to be widely adopted. Methods based on nucleic acid detection of this type are bringing increased accuracy to diagnosis in areas where culture is difficult and/or expensive, and these methods are often effective partners to other rapid molecular diagnostic tools such as matrix-assisted laser desorption ionisation-time of flight mass spectrometry (MALDI-TOF MS). This change in practice has particularly affected the recognition of viruses and fastidious or antibiotic-exposed bacteria, but has been also shown to be effective in the recognition of troublesome or specialised phenotypes such as antiviral resistance and transmissible antibiotic resistance in the Enterobacteriaceae. Quantitation and high-intensity sequencing (of multiple whole genomes) has brought new opportunities as well as new challenges to the microbiology community. Diagnostic microbiologists currently training might be expected to deal less with the culture-based techniques of the last half-century than with the high-volume data and complex analyses of the next.
Collapse
|
53
|
Wu SH, Rodrigo AG. Estimation of evolutionary parameters using short, random and partial sequences from mixed samples of anonymous individuals. BMC Bioinformatics 2015; 16:357. [PMID: 26536860 PMCID: PMC4634753 DOI: 10.1186/s12859-015-0810-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 10/30/2015] [Indexed: 11/17/2022] Open
Abstract
Background Over the last decade, next generation sequencing (NGS) has become widely available, and is now the sequencing technology of choice for most researchers. Nonetheless, NGS presents a challenge for the evolutionary biologists who wish to estimate evolutionary genetic parameters from a mixed sample of unlabelled or untagged individuals, especially when the reconstruction of full length haplotypes can be unreliable. We propose two novel approaches, least squares estimation (LS) and Approximate Bayesian Computation Markov chain Monte Carlo estimation (ABC-MCMC), to infer evolutionary genetic parameters from a collection of short-read sequences obtained from a mixed sample of anonymous DNA using the frequencies of nucleotides at each site only without reconstructing the full-length alignment nor the phylogeny. Results We used simulations to evaluate the performance of these algorithms, and our results demonstrate that LS performs poorly because bootstrap 95 % Confidence Intervals (CIs) tend to under- or over-estimate the true values of the parameters. In contrast, ABC-MCMC 95 % Highest Posterior Density (HPD) intervals recovered from ABC-MCMC enclosed the true parameter values with a rate approximately equivalent to that obtained using BEAST, a program that implements a Bayesian MCMC estimation of evolutionary parameters using full-length sequences. Because there is a loss of information with the use of sitewise nucleotide frequencies alone, the ABC-MCMC 95 % HPDs are larger than those obtained by BEAST. Conclusion We propose two novel algorithms to estimate evolutionary genetic parameters based on the proportion of each nucleotide. The LS method cannot be recommended as a standalone method for evolutionary parameter estimation. On the other hand, parameters recovered by ABC-MCMC are comparable to those obtained using BEAST, but with larger 95 % HPDs. One major advantage of ABC-MCMC is that computational time scales linearly with the number of short-read sequences, and is independent of the number of full-length sequences in the original data. This allows us to perform the analysis on NGS datasets with large numbers of short read fragments. The source code for ABC-MCMC is available at https://github.com/stevenhwu/SF-ABC.
Collapse
Affiliation(s)
- Steven H Wu
- Biodesign Institute, Arizona State University, Tempe, AZ, 85287, USA. .,Department of Biology, Duke University, Box 90338, Durham, NC, 27708, USA.
| | - Allen G Rodrigo
- Department of Biology, Duke University, Box 90338, Durham, NC, 27708, USA. .,The National Evolutionary Synthesis Center, Durham, NC, 27705, USA.
| |
Collapse
|
54
|
High-resolution genetic profile of viral genomes: why it matters. Curr Opin Virol 2015; 14:62-70. [DOI: 10.1016/j.coviro.2015.08.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Revised: 08/07/2015] [Accepted: 08/07/2015] [Indexed: 12/12/2022]
|
55
|
Iyer S, Casey E, Bouzek H, Kim M, Deng W, Larsen BB, Zhao H, Bumgarner RE, Rolland M, Mullins JI. Comparison of Major and Minor Viral SNPs Identified through Single Template Sequencing and Pyrosequencing in Acute HIV-1 Infection. PLoS One 2015; 10:e0135903. [PMID: 26317928 PMCID: PMC4552882 DOI: 10.1371/journal.pone.0135903] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2014] [Accepted: 07/27/2015] [Indexed: 01/03/2023] Open
Abstract
Massively parallel sequencing (MPS) technologies, such as 454-pyrosequencing, allow for the identification of variants in sequence populations at lower levels than consensus sequencing and most single-template Sanger sequencing experiments. We sought to determine if the greater depth of population sampling attainable using MPS technology would allow detection of minor variants in HIV founder virus populations very early in infection in instances where Sanger sequencing detects only a single variant. We compared single nucleotide polymorphisms (SNPs) during acute HIV-1 infection from 32 subjects using both single template Sanger and 454-pyrosequencing. Pyrosequences from a median of 2400 viral templates per subject and encompassing 40% of the HIV-1 genome, were compared to a median of five individually amplified near full-length viral genomes sequenced using Sanger technology. There was no difference in the consensus nucleotide sequences over the 3.6kb compared in 84% of the subjects infected with single founders and 33% of subjects infected with multiple founder variants: among the subjects with disagreements, mismatches were found in less than 1% of the sites evaluated (of a total of nearly 117,000 sites across all subjects). The majority of the SNPs observed only in pyrosequences were present at less than 2% of the subject’s viral sequence population. These results demonstrate the utility of the Sanger approach for study of early HIV infection and provide guidance regarding the design, utility and limitations of population sequencing from variable template sources, and emphasize parameters for improving the interpretation of massively parallel sequencing data to address important questions regarding target sequence evolution.
Collapse
Affiliation(s)
- Shyamala Iyer
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Eleanor Casey
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Heather Bouzek
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Moon Kim
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Wenjie Deng
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Brendan B. Larsen
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Hong Zhao
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Roger E. Bumgarner
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Morgane Rolland
- US Military HIV Research Program, WRAIR, Silver Spring, MD, 20910, United States of America
- Henry Jackson Foundation for the Advancement of Military Medicine, Inc., Bethesda, MD, 20817, United States of America
| | - James I. Mullins
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
- Department of Medicine, University of Washington, Seattle, WA, 98195, United States of America
- Department of Laboratory Medicine, Seattle, WA, 98195, United States of America
- * E-mail:
| |
Collapse
|
56
|
Pulido-Tamayo S, Sánchez-Rodríguez A, Swings T, Van den Bergh B, Dubey A, Steenackers H, Michiels J, Fostier J, Marchal K. Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. Nucleic Acids Res 2015; 43:e105. [PMID: 25990729 PMCID: PMC4652744 DOI: 10.1093/nar/gkv478] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 04/29/2015] [Indexed: 11/23/2022] Open
Abstract
Clonal populations accumulate mutations over time, resulting in different haplotypes. Deep sequencing of such a population in principle provides information to reconstruct these haplotypes and the frequency at which the haplotypes occur. However, this reconstruction is technically not trivial, especially not in clonal systems with a relatively low mutation frequency. The low number of segregating sites in those systems adds ambiguity to the haplotype phasing and thus obviates the reconstruction of genome-wide haplotypes based on sequence overlap information. Therefore, we present EVORhA, a haplotype reconstruction method that complements phasing information in the non-empty read overlap with the frequency estimations of inferred local haplotypes. As was shown with simulated data, as soon as read lengths and/or mutation rates become restrictive for state-of-the-art methods, the use of this additional frequency information allows EVORhA to still reliably reconstruct genome-wide haplotypes. On real data, we show the applicability of the method in reconstructing the population composition of evolved bacterial populations and in decomposing mixed bacterial infections from clinical samples.
Collapse
Affiliation(s)
- Sergio Pulido-Tamayo
- Department of Information Technology, Ghent University, iMinds, 9050 Gent, Belgium Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| | - Aminael Sánchez-Rodríguez
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium Departamento de Ciencias Naturales, Universidad Técnica Particular de Loja, San Cayetano Alto S/N, EC1101608 Loja, Ecuador
| | - Toon Swings
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Bram Van den Bergh
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Akanksha Dubey
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Hans Steenackers
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Jan Michiels
- Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | - Jan Fostier
- Department of Information Technology, Ghent University, iMinds, 9050 Gent, Belgium
| | - Kathleen Marchal
- Department of Information Technology, Ghent University, iMinds, 9050 Gent, Belgium Department of Microbial and Molecular Systems, Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| |
Collapse
|
57
|
Ogishi M, Yotsuyanagi H, Tsutsumi T, Gatanaga H, Ode H, Sugiura W, Moriya K, Oka S, Kimura S, Koike K. Deconvoluting the composition of low-frequency hepatitis C viral quasispecies: comparison of genotypes and NS3 resistance-associated variants between HCV/HIV coinfected hemophiliacs and HCV monoinfected patients in Japan. PLoS One 2015; 10:e0119145. [PMID: 25748426 PMCID: PMC4351984 DOI: 10.1371/journal.pone.0119145] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2014] [Accepted: 01/09/2015] [Indexed: 12/16/2022] Open
Abstract
Pre-existing low-frequency resistance-associated variants (RAVs) may jeopardize successful sustained virological responses (SVR) to HCV treatment with direct-acting antivirals (DAAs). However, the potential impact of low-frequency (∼0.1%) mutations, concatenated mutations (haplotypes), and their association with genotypes (Gts) on the treatment outcome has not yet been elucidated, most probably owing to the difficulty in detecting pre-existing minor haplotypes with sufficient length and accuracy. Herein, we characterize a methodological framework based on Illumina MiSeq next-generation sequencing (NGS) coupled with bioinformatics of quasispecies reconstruction (QSR) to realize highly accurate variant calling and genotype-haplotype detection. The core-to-NS3 protease coding sequences in 10 HCV monoinfected patients, 5 of whom had a history of blood transfusion, and 11 HCV/HIV coinfected patients with hemophilia, were studied. Simulation experiments showed that, for minor variants constituting more than 1%, our framework achieved a positive predictive value (PPV) of 100% and sensitivities of 91.7–100% for genotyping and 80.6% for RAV screening. Genotyping analysis indicated the prevalence of dominant Gt1a infection in coinfected patients (6/11 vs 0/10, p = 0.01). For clinical samples, minor genotype overlapping infection was prevalent in HCV/HIV coinfected hemophiliacs (10/11) and patients who experienced whole-blood transfusion (4/5) but none in patients without exposure to blood (0/5). As for RAV screening, the Q80K/R and S122K/R variants were particularly prevalent among minor RAVs observed, detected in 12/21 and 6/21 cases, respectively. Q80K was detected only in coinfected patients, whereas Q80R was predominantly detected in monoinfected patients (1/11 vs 7/10, p < 0.01). Multivariate interdependence analysis revealed the previously unrecognized prevalence of Gt1b-Q80K, in HCV/HIV coinfected hemophiliacs [Odds ratio = 13.4 (3.48–51.9), p < 0.01]. Our study revealed the distinct characteristics of viral quasispecies between the subgroups specified above and the feasibility of NGS and QSR-based genetic deconvolution of pre-existing minor Gts, RAVs, and their interrelationships.
Collapse
Affiliation(s)
- Masato Ogishi
- Department of Internal Medicine, Graduate School of Medicine, University of Tokyo, Bunkyo, Tokyo, Japan
| | - Hiroshi Yotsuyanagi
- Department of Internal Medicine, Graduate School of Medicine, University of Tokyo, Bunkyo, Tokyo, Japan
- * E-mail:
| | - Takeya Tsutsumi
- Department of Internal Medicine, Graduate School of Medicine, University of Tokyo, Bunkyo, Tokyo, Japan
| | - Hiroyuki Gatanaga
- AIDS Clinical Center, National Center for Global Health and Medicine, Shinjuku, Tokyo, Japan
| | - Hirotaka Ode
- Department of Infectious Diseases and Immunology, Clinical Research Center, Nagoya Medical Center, Nagoya, Japan
| | - Wataru Sugiura
- Department of Infectious Diseases and Immunology, Clinical Research Center, Nagoya Medical Center, Nagoya, Japan
| | - Kyoji Moriya
- Department of Internal Medicine, Graduate School of Medicine, University of Tokyo, Bunkyo, Tokyo, Japan
| | - Shinichi Oka
- AIDS Clinical Center, National Center for Global Health and Medicine, Shinjuku, Tokyo, Japan
| | - Satoshi Kimura
- Director, Tokyo Teishin Hospital, Tokyo, Japan; President, Tokyo Health Care University, Tokyo, Japan
| | - Kazuhiko Koike
- Department of Internal Medicine, Graduate School of Medicine, University of Tokyo, Bunkyo, Tokyo, Japan
| |
Collapse
|
58
|
Verbist B, Clement L, Reumers J, Thys K, Vapirev A, Talloen W, Wetzels Y, Meys J, Aerssens J, Bijnens L, Thas O. ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering. BMC Bioinformatics 2015; 16:59. [PMID: 25887734 PMCID: PMC4369097 DOI: 10.1186/s12859-015-0458-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2014] [Accepted: 12/16/2014] [Indexed: 11/10/2022] Open
Abstract
Background Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses. Results Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step. Conclusions ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0458-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Bie Verbist
- Department of Mathematical Modeling, Statistics and Bioinformatics, Ghent University, Coupure Links 653, Gent, 9000, Belgium.
| | - Lieven Clement
- Department of Applied Mathematics, Informatics and Statistics, Ghent University, Krijgslaan 281 S9, Gent, 9000, Belgium.
| | - Joke Reumers
- Janssen R&D, Janssen Pharmaceutical Companies of J&J, Turnhoutseweg 30, Beerse, 2340, Belgium.
| | - Kim Thys
- Janssen R&D, Janssen Pharmaceutical Companies of J&J, Turnhoutseweg 30, Beerse, 2340, Belgium.
| | - Alexander Vapirev
- Janssen R&D, Janssen Pharmaceutical Companies of J&J, Turnhoutseweg 30, Beerse, 2340, Belgium. .,ExaScience Life Lab, Kapeldreef 75, Leuven, 3001, Belgium.
| | - Willem Talloen
- Janssen R&D, Janssen Pharmaceutical Companies of J&J, Turnhoutseweg 30, Beerse, 2340, Belgium.
| | - Yves Wetzels
- Janssen R&D, Janssen Pharmaceutical Companies of J&J, Turnhoutseweg 30, Beerse, 2340, Belgium.
| | - Joris Meys
- Department of Mathematical Modeling, Statistics and Bioinformatics, Ghent University, Coupure Links 653, Gent, 9000, Belgium.
| | - Jeroen Aerssens
- Janssen R&D, Janssen Pharmaceutical Companies of J&J, Turnhoutseweg 30, Beerse, 2340, Belgium.
| | - Luc Bijnens
- Janssen R&D, Janssen Pharmaceutical Companies of J&J, Turnhoutseweg 30, Beerse, 2340, Belgium.
| | - Olivier Thas
- Department of Mathematical Modeling, Statistics and Bioinformatics, Ghent University, Coupure Links 653, Gent, 9000, Belgium. .,University of Wollongong, National Institute for Applied Statistics Research Australia (NIASRA), School of Mathematics and Applied Statistics, NSW, 2522, Australia.
| |
Collapse
|
59
|
Garcia V, Regoes RR. The Effect of Interference on the CD8(+) T Cell Escape Rates in HIV. Front Immunol 2015; 5:661. [PMID: 25628620 PMCID: PMC4292734 DOI: 10.3389/fimmu.2014.00661] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2014] [Accepted: 12/09/2014] [Indexed: 12/15/2022] Open
Abstract
In early human immunodeficiency virus (HIV) infection, the virus population escapes from multiple CD8+ cell responses. The later an escape mutation emerges, the slower it outgrows its competition, i.e., the escape rate is lower. This pattern could indicate that the strength of the CD8+ cell responses is waning, or that later viral escape mutants carry a larger fitness cost. In this paper, we investigate whether the pattern of decreasing escape rates could also be caused by genetic interference among different escape strains. To this end, we developed a mathematical multi-epitope model of HIV dynamics, which incorporates stochastic effects, recombination, and mutation. We used cumulative linkage disequilibrium measures to quantify the amount of interference. We found that nearly synchronous, similarly strong immune responses in two-locus systems enhance the generation of genetic interference. This effect, combined with a scheme of densely spaced sampling times at the beginning of infection and sparse sampling times later, leads to decreasing successive escape rate estimates, even when there were no selection differences among alleles. These predictions are supported by empirical data from one HIV-infected patient. Thus, interference could explain why later escapes are slower. Considering escape mutations in isolation, neglecting their genetic linkage, conceals the underlying haplotype dynamics and can affect the estimation of the selective pressure exerted by CD8+ cells. In systems in which multiple escape mutations appear, the occurrence of interference dynamics should be assessed by measuring the linkage between different escape mutations.
Collapse
Affiliation(s)
- Victor Garcia
- Institute of Integrative Biology, Department of Environmental Systems Science, ETH Zürich , Zurich , Switzerland
| | - Roland Robert Regoes
- Institute of Integrative Biology, Department of Environmental Systems Science, ETH Zürich , Zurich , Switzerland
| |
Collapse
|
60
|
Seifert D, Beerenwinkel N. Estimating Fitness of Viral Quasispecies from Next-Generation Sequencing Data. Curr Top Microbiol Immunol 2015; 392:181-200. [PMID: 26318139 DOI: 10.1007/82_2015_462] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The quasispecies model is ubiquitous in the study of viruses. While having lead to a number of insights that have stood the test of time, the quasispecies model has mostly been discussed in a theoretical fashion with little support of data. With next-generation sequencing (NGS), this situation is changing and a wealth of data can now be produced in a time- and cost-efficient manner. NGS can, after removal of technical errors, yield an exceedingly detailed picture of the viral population structure. The widespread availability of cross-sectional data can be used to study fitness landscapes of viral populations in the quasispecies model. This chapter highlights methods that estimate the strength of selection in selective sweeps, assesses marginal fitness effects of quasispecies, and finally infers the fitness landscape of a viral quasispecies, all on the basis of NGS data.
Collapse
|
61
|
Bioinformatics tools for the investigation of viral evolution and molecular epidemiology. INFECTION GENETICS AND EVOLUTION 2014; 28:349-50. [PMID: 25471675 DOI: 10.1016/j.meegid.2014.11.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
62
|
Jayasundara D, Saeed I, Maheswararajah S, Chang B, Tang SL, Halgamuge SK. ViQuaS: an improved reconstruction pipeline for viral quasispecies spectra generated by next-generation sequencing. Bioinformatics 2014; 31:886-96. [DOI: 10.1093/bioinformatics/btu754] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
|
63
|
Sequencing pools of individuals — mining genome-wide polymorphism data without big funding. Nat Rev Genet 2014; 15:749-63. [DOI: 10.1038/nrg3803] [Citation(s) in RCA: 512] [Impact Index Per Article: 46.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
64
|
Pandit A, de Boer RJ. Reliable reconstruction of HIV-1 whole genome haplotypes reveals clonal interference and genetic hitchhiking among immune escape variants. Retrovirology 2014; 11:56. [PMID: 24996694 PMCID: PMC4227095 DOI: 10.1186/1742-4690-11-56] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2014] [Accepted: 06/24/2014] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Following transmission, HIV-1 evolves into a diverse population, and next generation sequencing enables us to detect variants occurring at low frequencies. Studying viral evolution at the level of whole genomes was hitherto not possible because next generation sequencing delivers relatively short reads. RESULTS We here provide a proof of principle that whole HIV-1 genomes can be reliably reconstructed from short reads, and use this to study the selection of immune escape mutations at the level of whole genome haplotypes. Using realistically simulated HIV-1 populations, we demonstrate that reconstruction of complete genome haplotypes is feasible with high fidelity. We do not reconstruct all genetically distinct genomes, but each reconstructed haplotype represents one or more of the quasispecies in the HIV-1 population. We then reconstruct 30 whole genome haplotypes from published short sequence reads sampled longitudinally from a single HIV-1 infected patient. We confirm the reliability of the reconstruction by validating our predicted haplotype genes with single genome amplification sequences, and by comparing haplotype frequencies with observed epitope escape frequencies. CONCLUSIONS Phylogenetic analysis shows that the HIV-1 population undergoes selection driven evolution, with successive replacement of the viral population by novel dominant strains. We demonstrate that immune escape mutants evolve in a dependent manner with various mutations hitchhiking along with others. As a consequence of this clonal interference, selection coefficients have to be estimated for complete haplotypes and not for individual immune escapes.
Collapse
Affiliation(s)
- Aridaman Pandit
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
| | - Rob J de Boer
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
| |
Collapse
|
65
|
Shao W, Kearney MF, Boltz VF, Spindler JE, Mellors JW, Maldarelli F, Coffin JM. PAPNC, a novel method to calculate nucleotide diversity from large scale next generation sequencing data. J Virol Methods 2014; 203:73-80. [PMID: 24681054 PMCID: PMC4104926 DOI: 10.1016/j.jviromet.2014.03.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2014] [Revised: 03/10/2014] [Accepted: 03/11/2014] [Indexed: 02/06/2023]
Abstract
Estimating viral diversity in infected patients can provide insight into pathogen evolution and emergence of drug resistance. With the widespread adoption of deep sequencing, it is important to develop tools to accurately calculate population diversity from very large datasets. Current methods for estimating diversity that are based on multiple alignments are not practical to apply to such data. In this study, the authors report a novel method (Pairwise Alignment Positional Nucleotide Counting, PAPNC) for estimating population diversity from 454 sequence data. The diversity measurements determined using this method were comparable to those calculated by average pairwise difference (APD) of multiply aligned sequences using MEGA5. Diversities were estimated for 9 patient plasma HIV samples sequenced with Titanium 454 technology and by single-genome sequencing (SGS). Diversities calculated from deep sequencing using PAPNC ranged from 0.002 to 0.021 while APD measurements calculated from SGS data ranged proximately from 0.001 to 0.018, with the difference being attributable to PCR error (contributing background diversity of 0.0016 in a control sample). Comparison of APDs estimated from 100 sets of sequences drawn at random from 454 generated data and from corresponding SGS data showed very close correlation between the two methods with R(2) of 0.96, and differing on average by about 1% (after correction for PCR error). The authors have developed a novel method that is good for calculating genetic diversities for large scale datasets from next generation sequencing. It can be implemented easily as a function in available variation calling programs like SAMtools or haplotype reconstruction software for nucleotide genetic diversity calculation. A Perl script implementing this method is available upon request.
Collapse
Affiliation(s)
- Wei Shao
- Advanced Biomedical Computing Center, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, Frederick, MD, United States.
| | - Mary F Kearney
- HIV Drug Resistance Program, NCI, Frederick, MD, United States
| | - Valerie F Boltz
- HIV Drug Resistance Program, NCI, Frederick, MD, United States
| | | | - John W Mellors
- Division of Infectious Diseases, University of Pittsburgh, Pittsburgh, PA, United States
| | | | - John M Coffin
- Department of Molecular Biology and Microbiology, Tufts University, Boston, MA, United States
| |
Collapse
|
66
|
Mangul S, Wu NC, Mancuso N, Zelikovsky A, Sun R, Eskin E. Accurate viral population assembly from ultra-deep sequencing data. Bioinformatics 2014; 30:i329-37. [PMID: 24932001 PMCID: PMC4058922 DOI: 10.1093/bioinformatics/btu295] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. RESULTS In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation-maximization algorithm to estimate abundances of the assembled viral variants in the population. RESULTS on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads. AVAILABILITY Our tool VGA is freely available at http://genetics.cs.ucla.edu/vga/
Collapse
Affiliation(s)
- Serghei Mangul
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Nicholas C Wu
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Nicholas Mancuso
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Alex Zelikovsky
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Ren Sun
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| | - Eleazar Eskin
- Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USAComputer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
67
|
HIV-1 quasispecies delineation by tag linkage deep sequencing. PLoS One 2014; 9:e97505. [PMID: 24842159 PMCID: PMC4026136 DOI: 10.1371/journal.pone.0097505] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2014] [Accepted: 04/17/2014] [Indexed: 12/16/2022] Open
Abstract
Trade-offs between throughput, read length, and error rates in high-throughput sequencing limit certain applications such as monitoring viral quasispecies. Here, we describe a molecular-based tag linkage method that allows assemblage of short sequence reads into long DNA fragments. It enables haplotype phasing with high accuracy and sensitivity to interrogate individual viral sequences in a quasispecies. This approach is demonstrated to deduce ∼2000 unique 1.3 kb viral sequences from HIV-1 quasispecies in vivo and after passaging ex vivo with a detection limit of ∼0.005% to ∼0.001%. Reproducibility of the method is validated quantitatively and qualitatively by a technical replicate. This approach can improve monitoring of the genetic architecture and evolution dynamics in any quasispecies population.
Collapse
|
68
|
Gregori J, Salicrú M, Domingo E, Sanchez A, Esteban JI, Rodríguez-Frías F, Quer J. Inference with viral quasispecies diversity indices: clonal and NGS approaches. Bioinformatics 2014; 30:1104-1111. [PMID: 24389655 DOI: 10.1093/bioinformatics/btt768] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2013] [Accepted: 12/25/2013] [Indexed: 02/07/2023] Open
Abstract
UNLABELLED Given the inherent dynamics of a viral quasispecies, we are often interested in the comparison of diversity indices of sequential samples of a patient, or in the comparison of diversity indices of virus in groups of patients in a treated versus control design. It is then important to make sure that the diversity measures from each sample may be compared with no bias and within a consistent statistical framework. In the present report, we review some indices often used as measures for viral quasispecies complexity and provide means for statistical inference, applying procedures taken from the ecology field. In particular, we examine the Shannon entropy and the mutation frequency, and we discuss the appropriateness of different normalization methods of the Shannon entropy found in the literature. By taking amplicons ultra-deep pyrosequencing (UDPS) raw data as a surrogate of a real hepatitis C virus viral population, we study through in-silico sampling the statistical properties of these indices under two methods of viral quasispecies sampling, classical cloning followed by Sanger sequencing (CCSS) and next-generation sequencing (NGS) such as UDPS. We propose solutions specific to each of the two sampling methods-CCSS and NGS-to guarantee statistically conforming conclusions as free of bias as possible. CONTACT josep.gregori@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Josep Gregori
- Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain
| | - Miquel Salicrú
- Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain
| | - Esteban Domingo
- Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain
| | - Alex Sanchez
- Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain
| | - Juan I Esteban
- Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain
| | - Francisco Rodríguez-Frías
- Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain
| | - Josep Quer
- Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain Liver Unit, Internal Medicine Lab Malalties Hepàtiques, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035 Barcelona, Spain, Roche Diagnostics SL, 08174, Sant Cugat del Vallès, Spain, Statistics Department, Biology Faculty, Barcelona University, 08028, Barcelona, Spain, CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, 28029 Madrid, Spain, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Campus de Cantoblanco, 28049, Madrid, Spain, Bioinformatics and Statistics Unit, Vall d'Hebron Institut Recerca (VHIR-HUVH), 08035, Barcelona, Spain, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain and Biochemistry Unit. Virology Unit/Microbiology Department, HUVH, 08035 Barcelona, Spain
| |
Collapse
|
69
|
Töpfer A, Marschall T, Bull RA, Luciani F, Schönhuth A, Beerenwinkel N. Viral quasispecies assembly via maximal clique enumeration. PLoS Comput Biol 2014; 10:e1003515. [PMID: 24675810 PMCID: PMC3967922 DOI: 10.1371/journal.pcbi.1003515] [Citation(s) in RCA: 76] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2013] [Accepted: 01/31/2014] [Indexed: 11/25/2022] Open
Abstract
Virus populations can display high genetic diversity within individual hosts. The intra-host collection of viral haplotypes, called viral quasispecies, is an important determinant of virulence, pathogenesis, and treatment outcome. We present HaploClique, a computational approach to reconstruct the structure of a viral quasispecies from next-generation sequencing data as obtained from bulk sequencing of mixed virus samples. We develop a statistical model for paired-end reads accounting for mutations, insertions, and deletions. Using an iterative maximal clique enumeration approach, read pairs are assembled into haplotypes of increasing length, eventually enabling global haplotype assembly. The performance of our quasispecies assembly method is assessed on simulated data for varying population characteristics and sequencing technology parameters. Owing to its paired-end handling, HaploClique compares favorably to state-of-the-art haplotype inference methods. It can reconstruct error-free full-length haplotypes from low coverage samples and detect large insertions and deletions at low frequencies. We applied HaploClique to sequencing data derived from a clinical hepatitis C virus population of an infected patient and discovered a novel deletion of length 357±167 bp that was validated by two independent long-read sequencing experiments. HaploClique is available at https://github.com/armintoepfer/haploclique. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2-5.
Collapse
Affiliation(s)
- Armin Töpfer
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | | | - Rowena A. Bull
- Inflammation and Infection Research Centre, School of Medical Sciences, UNSW, Sydney, Australia
| | - Fabio Luciani
- Inflammation and Infection Research Centre, School of Medical Sciences, UNSW, Sydney, Australia
| | | | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| |
Collapse
|
70
|
Epstein-Barr virus latent membrane protein 1 genetic variability in peripheral blood B cells and oropharyngeal fluids. J Virol 2014; 88:3744-55. [PMID: 24429365 DOI: 10.1128/jvi.03378-13] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
UNLABELLED We report the diversity of latent membrane protein 1 (LMP1) gene founder sequences and the level of Epstein-Barr virus (EBV) genome variability over time and across anatomic compartments by using virus genomes amplified directly from oropharyngeal wash specimens and peripheral blood B cells during acute infection and convalescence. The intrahost nucleotide variability of the founder virus was 0.02% across the region sequences, and diversity increased significantly over time in the oropharyngeal compartment (P = 0.004). The LMP1 region showing the greatest level of variability in both compartments, and over time, was concentrated within the functional carboxyl-terminal activating regions 2 and 3 (CTAR2 and CTAR3). Interestingly, a deletion in a proline-rich repeat region (amino acids 274 to 289) of EBV commonly reported in EBV sequenced from cancer specimens was not observed in acute infectious mononucleosis (AIM) patients. Taken together, these data highlight the diversity in circulating EBV genomes and its potential importance in disease pathogenesis and vaccine design. IMPORTANCE This study is among the first to leverage an improved high-throughput deep-sequencing methodology to investigate directly from patient samples the degree of diversity in Epstein-Barr virus (EBV) populations and the extent to which viral genome diversity develops over time in the infected host. Significant variability of circulating EBV latent membrane protein 1 (LMP1) gene sequences was observed between cellular and oral wash samples, and this variability increased over time in oral wash samples. The significance of EBV genetic diversity in transmission and disease pathogenesis are discussed.
Collapse
|
71
|
Prabhakaran S, Rey M, Zagordi O, Beerenwinkel N, Roth V. HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:182-191. [PMID: 26355517 DOI: 10.1109/tcbb.2013.145] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
This paper presents a new computational technique for the identification of HIV haplotypes. HIV tends to generate many potentially drug-resistant mutants within the HIV-infected patient and being able to identify these different mutants is important for efficient drug administration. With the view of identifying the mutants, we aim at analyzing short deep sequencing data called reads. From a statistical perspective, the analysis of such data can be regarded as a nonstandard clustering problem due to missing pairwise similarity measures between non-overlapping reads. To overcome this problem we propagate a Dirichlet Process Mixture Model by sequentially updating the prior information from successive local analyses. The model is verified using both simulated and real sequencing data.
Collapse
|
72
|
Gregori J, Esteban JI, Cubero M, Garcia-Cehic D, Perales C, Casillas R, Alvarez-Tejado M, Rodríguez-Frías F, Guardia J, Domingo E, Quer J. Ultra-deep pyrosequencing (UDPS) data treatment to study amplicon HCV minor variants. PLoS One 2013; 8:e83361. [PMID: 24391758 PMCID: PMC3877031 DOI: 10.1371/journal.pone.0083361] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2013] [Accepted: 11/08/2013] [Indexed: 02/07/2023] Open
Abstract
We have investigated the reliability and reproducibility of HCV viral quasispecies quantification by ultra-deep pyrosequencing (UDPS) methods. Our study has been divided in two parts. First of all, by UDPS sequencing of clone mixes samples we have established the global noise level of UDPS and fine tuned a data treatment workflow previously optimized for HBV sequence analysis. Secondly, we have studied the reproducibility of the methodology by comparing 5 amplicons from two patient samples on three massive sequencing platforms (FLX+, FLX and Junior) after applying the error filters developed from the clonal/control study. After noise filtering the UDPS results, the three replicates showed the same 12 polymorphic sites above 0.7%, with a mean CV of 4.86%. Two polymorphic sites below 0.6% were identified by two replicates and one replicate respectively. A total of 25, 23 and 26 haplotypes were detected by GS-Junior, GS-FLX and GS-FLX+. The observed CVs for the normalized Shannon entropy (Sn), the mutation frequency (Mf), and the nucleotidic diversity (Pi) were 1.46%, 3.96% and 3.78%. The mean absolute difference in the two patients (5 amplicons each), in the GS-FLX and GS-FLX+, were 1.46%, 3.96% and 3.78% for Sn, Mf and Pi. No false polymorphic site was observed above 0.5%. Our results indicate that UDPS is an optimal alternative to molecular cloning for quantitative study of HCV viral quasispecies populations, both in complexity and composition. We propose an UDPS data treatment workflow for amplicons from the RNA viral quasispecies which, at a sequencing depth of at least 10,000 reads per strand, enables to obtain sequences and frequencies of consensus haplotypes above 0.5% abundance with no erroneous mutations, with high confidence, resistant mutants as minor variants at the level of 1%, with high confidence that variants are not missed, and highly confident measures of quasispecies complexity.
Collapse
Affiliation(s)
- Josep Gregori
- Liver Unit, Internal Medicine, Lab. Malalties Hepàtiques, Vall d'Hebron Institut Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Barcelona, Spain
- Roche Diagnostics SL, Sant Cugat del Vallès, Spain
| | - Juan I. Esteban
- Liver Unit, Internal Medicine, Lab. Malalties Hepàtiques, Vall d'Hebron Institut Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Barcelona, Spain
- CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, Madrid, Spain
- Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - María Cubero
- Liver Unit, Internal Medicine, Lab. Malalties Hepàtiques, Vall d'Hebron Institut Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Barcelona, Spain
- CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, Madrid, Spain
| | - Damir Garcia-Cehic
- Liver Unit, Internal Medicine, Lab. Malalties Hepàtiques, Vall d'Hebron Institut Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Barcelona, Spain
- CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, Madrid, Spain
| | - Celia Perales
- CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, Madrid, Spain
- Centro de Biología Molecular Severo Ochoa (CBM), UAM, Madrid, Spain
| | - Rosario Casillas
- Liver Unit, Internal Medicine, Lab. Malalties Hepàtiques, Vall d'Hebron Institut Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Barcelona, Spain
- Biochemistry Unit, HUVH, Barcelona, Spain
| | | | - Francisco Rodríguez-Frías
- CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, Madrid, Spain
- Universitat Autònoma de Barcelona, Bellaterra, Spain
- Biochemistry Unit, HUVH, Barcelona, Spain
| | - Jaume Guardia
- Liver Unit, Internal Medicine, Lab. Malalties Hepàtiques, Vall d'Hebron Institut Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Barcelona, Spain
- CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, Madrid, Spain
- Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - Esteban Domingo
- CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, Madrid, Spain
- Centro de Biología Molecular Severo Ochoa (CBM), UAM, Madrid, Spain
| | - Josep Quer
- Liver Unit, Internal Medicine, Lab. Malalties Hepàtiques, Vall d'Hebron Institut Recerca-Hospital Universitari Vall d'Hebron (VHIR-HUVH), Barcelona, Spain
- CIBER de Enfermedades Hepáticas y Digestivas (CIBERehd) del Instituto de Salud Carlos III, Madrid, Spain
- Universitat Autònoma de Barcelona, Bellaterra, Spain
| |
Collapse
|
73
|
Poh WT, Xia E, Chin-Inmanu K, Wong LP, Cheng AY, Malasit P, Suriyaphol P, Teo YY, Ong RTH. Viral quasispecies inference from 454 pyrosequencing. BMC Bioinformatics 2013; 14:355. [PMID: 24308284 PMCID: PMC4234478 DOI: 10.1186/1471-2105-14-355] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2013] [Accepted: 11/15/2013] [Indexed: 02/05/2023] Open
Abstract
Background Many potentially life-threatening infectious viruses are highly mutable in nature. Characterizing the fittest variants within a quasispecies from infected patients is expected to allow unprecedented opportunities to investigate the relationship between quasispecies diversity and disease epidemiology. The advent of next-generation sequencing technologies has allowed the study of virus diversity with high-throughput sequencing, although these methods come with higher rates of errors which can artificially increase diversity. Results Here we introduce a novel computational approach that incorporates base quality scores from next-generation sequencers for reconstructing viral genome sequences that simultaneously infers the number of variants within a quasispecies that are present. Comparisons on simulated and clinical data on dengue virus suggest that the novel approach provides a more accurate inference of the underlying number of variants within the quasispecies, which is vital for clinical efforts in mapping the within-host viral diversity. Sequence alignments generated by our approach are also found to exhibit lower rates of error. Conclusions The ability to infer the viral quasispecies colony that is present within a human host provides the potential for a more accurate classification of the viral phenotype. Understanding the genomics of viruses will be relevant not just to studying how to control or even eradicate these viral infectious diseases, but also in learning about the innate protection in the human host against the viruses.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Yik-Ying Teo
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Singapore.
| | | |
Collapse
|
74
|
Aita T, Ichihashi N, Yomo T. Probabilistic model based error correction in a set of various mutant sequences analyzed by next-generation sequencing. Comput Biol Chem 2013; 47:221-30. [PMID: 24184706 DOI: 10.1016/j.compbiolchem.2013.09.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2013] [Revised: 09/13/2013] [Accepted: 09/27/2013] [Indexed: 01/14/2023]
Abstract
To analyze the evolutionary dynamics of a mutant population in an evolutionary experiment, it is necessary to sequence a vast number of mutants by high-throughput (next-generation) sequencing technologies, which enable rapid and parallel analysis of multikilobase sequences. However, the observed sequences include many errors of base call. Therefore, if next-generation sequencing is applied to analysis of a heterogeneous population of various mutant sequences, it is necessary to discriminate between true bases as point mutations and errors of base call in the observed sequences, and to subject the sequences to error-correction processes. To address this issue, we have developed a novel method of error correction based on the Potts model and a maximum a posteriori probability (MAP) estimate of its parameters corresponding to the "true sequences". Our method of error correction utilizes (1) the "quality scores" which are assigned to individual bases in the observed sequences and (2) the neighborhood relationship among the observed sequences mapped in sequence space. The computer experiments of error correction of artificially generated sequences supported the effectiveness of our method, showing that 50-90% of errors were removed. Interestingly, this method is analogous to a probabilistic model based method of image restoration developed in the field of information engineering.
Collapse
Affiliation(s)
- Takuyo Aita
- Exploratory Research for Advanced Technology, Japan Science and Technology Agency, Yamadaoka 1-5, Suita, Osaka, Japan
| | | | | |
Collapse
|
75
|
Yang X, Charlebois P, Macalalad A, Henn MR, Zody MC. V-Phaser 2: variant inference for viral populations. BMC Genomics 2013; 14:674. [PMID: 24088188 PMCID: PMC3907024 DOI: 10.1186/1471-2164-14-674] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2013] [Accepted: 09/26/2013] [Indexed: 11/14/2022] Open
Abstract
Background Massively parallel sequencing offers the possibility of revolutionizing the study of viral populations by providing ultra deep sequencing (tens to hundreds of thousand fold coverage) of complete viral genomes. However, differentiation of true low frequency variants from sequencing errors remains challenging. Results We developed a software package, V-Phaser 2, for inferring intrahost diversity within viral populations. This program adds three major new methodologies to the state of the art: a technique to efficiently utilize paired end read data for calling phased variants, a new strategy to represent and infer length polymorphisms, and an in line filter for erroneous calls arising from systematic sequencing artifacts. We have also heavily optimized memory and run time performance. This combination of algorithmic and technical advances allows V-Phaser 2 to fully utilize extremely deep paired end sequencing data (such as generated by Illumina sequencers) to accurately infer low frequency intrahost variants in viral populations in reasonable time on a standard desktop computer. V-Phaser 2 was validated and compared to both QuRe and the original V-Phaser on three datasets obtained from two viral populations: a mixture of eight known strains of West Nile Virus (WNV) sequenced on both 454 Titanium and Illumina MiSeq and a mixture of twenty-four known strains of WNV sequenced only on 454 Titanium. V-Phaser 2 outperformed the other two programs in both sensitivity and specificity while using more than five fold less time and memory. Conclusions We developed V-Phaser 2, a publicly available software tool (V-Phaser 2 can be accessed via: http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/v-phaser-2 and is freely available for academic use) that enables the efficient analysis of ultra-deep sequencing data produced by common next generation sequencing platforms for viral populations.
Collapse
Affiliation(s)
- Xiao Yang
- Broad Institute of MIT & Harvard, 7 Cambridge Center, Cambridge, MA 02142 USA.
| | | | | | | | | |
Collapse
|
76
|
Prosperi MCF, Yin L, Nolan DJ, Lowe AD, Goodenow MM, Salemi M. Empirical validation of viral quasispecies assembly algorithms: state-of-the-art and challenges. Sci Rep 2013; 3:2837. [PMID: 24089188 PMCID: PMC3789152 DOI: 10.1038/srep02837] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2013] [Accepted: 09/13/2013] [Indexed: 11/22/2022] Open
Abstract
Next generation sequencing (NGS) is superseding Sanger technology for analysing intra-host viral populations, in terms of genome length and resolution. We introduce two new empirical validation data sets and test the available viral population assembly software. Two intra-host viral population 'quasispecies' samples (type-1 human immunodeficiency and hepatitis C virus) were Sanger-sequenced, and plasmid clone mixtures at controlled proportions were shotgun-sequenced using Roche's 454 sequencing platform. The performance of different assemblers was compared in terms of phylogenetic clustering and recombination with the Sanger clones. Phylogenetic clustering showed that all assemblers captured a proportion of the most divergent lineages, but none were able to provide a high precision/recall tradeoff. Estimated variant frequencies mildly correlated with the original. Given the limitations of currently available algorithms identified by our empirical validation, the development and exploitation of additional data sets is needed, in order to establish an efficient framework for viral population reconstruction using NGS.
Collapse
Affiliation(s)
- Mattia C. F. Prosperi
- University of Manchester, Faculty of Medical and Human Sciences, Northwest Institute of Bio-Health Informatics, Centre for Health Informatics, Institute of Population Health, Manchester, UK
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
| | - Li Yin
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
- Florida Center for AIDS Research, Gainesville, Florida, USA
| | - David J. Nolan
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
| | - Amanda D. Lowe
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
- Florida Center for AIDS Research, Gainesville, Florida, USA
| | - Maureen M. Goodenow
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
- Florida Center for AIDS Research, Gainesville, Florida, USA
| | - Marco Salemi
- University of Florida, College of Medicine, Department of Pathology, Immunology and Laboratory Medicine, Gainesville, Florida, USA
- Florida Center for AIDS Research, Gainesville, Florida, USA
- Emerging Pathogens Institute, Gainesville, Florida, USA
| |
Collapse
|
77
|
Improved detection of rare HIV-1 variants using 454 pyrosequencing. PLoS One 2013; 8:e76502. [PMID: 24098517 PMCID: PMC3788733 DOI: 10.1371/journal.pone.0076502] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2013] [Accepted: 08/27/2013] [Indexed: 01/21/2023] Open
Abstract
454 pyrosequencing, a massively parallel sequencing (MPS) technology, is often used to study HIV genetic variation. However, the substantial mismatch error rate of the PCR required to prepare HIV-containing samples for pyrosequencing has limited the detection of rare variants within viral populations to those present above ~1%. To improve detection of rare variants, we varied PCR enzymes and conditions to identify those that combined high sensitivity with a low error rate. Substitution errors were found to vary up to 3-fold between the different enzymes tested. The sensitivity of each enzyme, which impacts the number of templates amplified for pyrosequencing, was shown to vary, although not consistently across genes and different samples. We also describe an amplicon-based method to improve the consistency of read coverage over stretches of the HIV-1 genome. Twenty-two primers were designed to amplify 11 overlapping amplicons in the HIV-1 clade B gag-pol and env gp120 coding regions to encompass 4.7 kb of the viral genome per sample at sensitivities as low as 0.01-0.2%.
Collapse
|
78
|
Iyer S, Bouzek H, Deng W, Larsen B, Casey E, Mullins JI. Quality score based identification and correction of pyrosequencing errors. PLoS One 2013; 8:e73015. [PMID: 24039850 PMCID: PMC3764156 DOI: 10.1371/journal.pone.0073015] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Accepted: 07/22/2013] [Indexed: 12/26/2022] Open
Abstract
Massively-parallel DNA sequencing using the 454/pyrosequencing platform allows in-depth probing of diverse sequence populations, such as within an HIV-1 infected individual. Analysis of this sequence data, however, remains challenging due to the shorter read lengths relative to that obtained by Sanger sequencing as well as errors introduced during DNA template amplification and during pyrosequencing. The ability to distinguish real variation from pyrosequencing errors with high sensitivity and specificity is crucial to interpreting sequence data. We introduce a new algorithm, CorQ (Correction through Quality), which utilizes the inherent base quality in a sequence-specific context to correct for homopolymer and non-homopolymer insertion and deletion (indel) errors. CorQ also takes uneven read mapping into account for correcting pyrosequencing miscall errors and it identifies and corrects carry forward errors. We tested the ability of CorQ to correctly call SNPs on a set of pyrosequences derived from ten viral genomes from an HIV-1 infected individual, as well as on six simulated pyrosequencing datasets generated using non-zero error rates to emulate errors introduced by PCR. When combined with the AmpliconNoise error correction method developed to remove ambiguities in signal intensities, we attained a 97% reduction in indel errors, a 98% reduction in carry forward errors, and >97% specificity of SNP detection. When compared to four other error correction methods, AmpliconNoise+CorQ performed at equal or higher SNP identification specificity, but the sensitivity of SNP detection was consistently higher (>98%) than other methods tested. This combined procedure will therefore permit examination of complex genetic populations with improved accuracy.
Collapse
Affiliation(s)
- Shyamala Iyer
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Heather Bouzek
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Wenjie Deng
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Brendan Larsen
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Eleanor Casey
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - James I. Mullins
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
79
|
Deng W, Maust BS, Westfall DH, Chen L, Zhao H, Larsen BB, Iyer S, Liu Y, Mullins JI. Indel and Carryforward Correction (ICC): a new analysis approach for processing 454 pyrosequencing data. ACTA ACUST UNITED AC 2013; 29:2402-9. [PMID: 23900188 DOI: 10.1093/bioinformatics/btt434] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
MOTIVATION Pyrosequencing technology provides an important new approach to more extensively characterize diverse sequence populations and detect low frequency variants. However, the promise of this technology has been difficult to realize, as careful correction of sequencing errors is crucial to distinguish rare variants (∼1%) in an infected host with high sensitivity and specificity. RESULTS We developed a new approach, referred to as Indel and Carryforward Correction (ICC), to cluster sequences without substitutions and locally correct only indel and carryforward sequencing errors within clusters to ensure that no rare variants are lost. ICC performs sequence clustering in the order of (i) homopolymer indel patterns only, (ii) indel patterns only and (iii) carryforward errors only, without the requirement of a distance cutoff value. Overall, ICC removed 93-95% of sequencing errors found in control datasets. On pyrosequencing data from a PCR fragment derived from 15 HIV-1 plasmid clones mixed at various frequencies as low as 0.1%, ICC achieved the highest sensitivity and similar specificity compared with other commonly used error correction and variant calling algorithms. AVAILABILITY AND IMPLEMENTATION Source code is freely available for download at http://indra.mullins.microbiol.washington.edu/ICC. It is implemented in Perl and supported on Linux, Mac OS X and MS Windows.
Collapse
Affiliation(s)
- Wenjie Deng
- Department of Microbiology, University of Washington School of Medicine, Seattle, WA 98195, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
80
|
Huang A, Kantor R, DeLong A, Schreier L, Istrail S. QColors: an algorithm for conservative viral quasispecies reconstruction from short and non-contiguous next generation sequencing reads. In Silico Biol 2013; 11:193-201. [PMID: 23202421 PMCID: PMC5530257 DOI: 10.3233/isb-2012-0454] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
Next generation sequencing technologies have recently been applied to characterize mutational spectra of the heterogeneous population of viral genotypes (known as a quasispecies) within HIV-infected patients. Such information is clinically relevant because minority genetic subpopulations of HIV within patients enable viral escape from selection pressures such as the immune response and antiretroviral therapy. However, methods for quasispecies sequence reconstruction from next generation sequencing reads are not yet widely used and remains an emerging area of research. Furthermore, the majority of research methodology in HIV has focused on 454 sequencing, while many next-generation sequencing platforms used in practice are limited to shorter read lengths relative to 454 sequencing. Little work has been done in determining how best to address the read length limitations of other platforms. The approach described here incorporates graph representations of both read differences and read overlap to conservatively determine the regions of the sequence with sufficient variability to separate quasispecies sequences. Within these tractable regions of quasispecies inference, we use constraint programming to solve for an optimal quasispecies subsequence determination via vertex coloring of the conflict graph, a representation which also lends itself to data with non-contiguous reads such as paired-end sequencing. We demonstrate the utility of the method by applying it to simulations based on actual intra-patient clonal HIV-1 sequencing data.
Collapse
Affiliation(s)
- Austin Huang
- Division of Infectious Disease, Computer Science Department, Brown University, Box 1910, Providence, RI 02912, USA.
| | | | | | | | | |
Collapse
|
81
|
Eyre DW, Cule ML, Griffiths D, Crook DW, Peto TEA, Walker AS, Wilson DJ. Detection of mixed infection from bacterial whole genome sequence data allows assessment of its role in Clostridium difficile transmission. PLoS Comput Biol 2013; 9:e1003059. [PMID: 23658511 PMCID: PMC3642043 DOI: 10.1371/journal.pcbi.1003059] [Citation(s) in RCA: 62] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2012] [Accepted: 03/28/2013] [Indexed: 01/31/2023] Open
Abstract
Bacterial whole genome sequencing offers the prospect of rapid and high precision investigation of infectious disease outbreaks. Close genetic relationships between microorganisms isolated from different infected cases suggest transmission is a strong possibility, whereas transmission between cases with genetically distinct bacterial isolates can be excluded. However, undetected mixed infections—infection with ≥2 unrelated strains of the same species where only one is sequenced—potentially impairs exclusion of transmission with certainty, and may therefore limit the utility of this technique. We investigated the problem by developing a computationally efficient method for detecting mixed infection without the need for resource-intensive independent sequencing of multiple bacterial colonies. Given the relatively low density of single nucleotide polymorphisms within bacterial sequence data, direct reconstruction of mixed infection haplotypes from current short-read sequence data is not consistently possible. We therefore use a two-step maximum likelihood-based approach, assuming each sample contains up to two infecting strains. We jointly estimate the proportion of the infection arising from the dominant and minor strains, and the sequence divergence between these strains. In cases where mixed infection is confirmed, the dominant and minor haplotypes are then matched to a database of previously sequenced local isolates. We demonstrate the performance of our algorithm with in silico and in vitro mixed infection experiments, and apply it to transmission of an important healthcare-associated pathogen, Clostridium difficile. Using hospital ward movement data in a previously described stochastic transmission model, 15 pairs of cases enriched for likely transmission events associated with mixed infection were selected. Our method identified four previously undetected mixed infections, and a previously undetected transmission event, but no direct transmission between the pairs of cases under investigation. These results demonstrate that mixed infections can be detected without additional sequencing effort, and this will be important in assessing the extent of cryptic transmission in our hospitals. Traditionally, outbreaks of infectious diseases are investigated by considering contact between cases and their exposure to possible sources of infection. This can be enhanced by using the genetic fingerprint of bacteria to rule out transmission between cases infected with unrelated strains. However, in some cases patients are infected with more than one strain of the same species of bacteria. This is known as mixed infection. Using current methods usually only one strain of bacteria is analysed, so transmission might be ruled out wrongly if there is a mixed infection. We developed a method that exploits new high-resolution genetic fingerprinting in bacteria to detect patients that are infected with multiple strains of the same bacterial species. We investigated the important healthcare-associated infection Clostridium difficile, revealing previously undetected mixed infections, and identifying a previously undetected transmission event. By interrogating a database of bacterial strains, our method deduced the mixed strain types, which we showed were not compatible with direct transmission among the patients under investigation. Our method can improve the sensitivity of outbreak investigation across different types of bacteria, which will ultimately help to reduce transmission in hospitals and the community.
Collapse
Affiliation(s)
- David W Eyre
- Nuffield Department of Clinical Medicine, University of Oxford, John Radcliffe Hospital, Oxford, United Kingdom.
| | | | | | | | | | | | | |
Collapse
|
82
|
Töpfer A, Höper D, Blome S, Beer M, Beerenwinkel N, Ruggli N, Leifer I. Sequencing approach to analyze the role of quasispecies for classical swine fever. Virology 2013; 438:14-9. [PMID: 23415390 DOI: 10.1016/j.virol.2012.11.020] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2012] [Accepted: 11/28/2012] [Indexed: 10/27/2022]
Abstract
Classical swine fever virus (CSFV) is a positive-sense RNA virus with a high degree of genetic variability among isolates. High diversity is also found in virulence, with strains covering the complete spectrum from avirulent to highly virulent. The underlying genetic determinants are far from being understood. Since RNA polymerases of RNA viruses lack any proof-reading activity, different genome variations called haplotypes, occur during replication. A set of haplotypes is referred to as a viral quasispecies. Genetic variability can be a fitness advantage through facilitating of a more effective escape from the host immune response. In order to investigate the correlation of quasispecies composition and virulence in vivo, we analyzed next-generation sequencing data of CSFV isolates of varying virulence. Viral samples from pigs infected with the highly virulent isolates "Koslov" and "Brescia" showed higher quasispecies diversity and more nucleotide variability, compared to samples of pigs infected with low and moderately virulent isolates.
Collapse
Affiliation(s)
- Armin Töpfer
- Department of Biosystems Science and Engineering, ETH Zurich, Mattenstrasse 26, CH-4058 Basel, Switzerland.
| | | | | | | | | | | | | |
Collapse
|
83
|
Hick P, Gore K, Whittington R. Molecular epidemiology of betanodavirus—Sequence analysis strategies and quasispecies influence outbreak source attribution. Virology 2013; 436:15-23. [DOI: 10.1016/j.virol.2012.10.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2012] [Revised: 09/28/2012] [Accepted: 10/05/2012] [Indexed: 11/24/2022]
|
84
|
Abstract
Advances in sequencing technologies and increased access to sequencing services have led to renewed interest in sequence and genome assembly. Concurrently, new applications for sequencing have emerged, including gene expression analysis, discovery of genomic variants and metagenomics, and each of these has different needs and challenges in terms of assembly. We survey the theoretical foundations that underlie modern assembly and highlight the options and practical trade-offs that need to be considered, focusing on how individual features address the needs of specific applications. We also review key software and the interplay between experimental design and efficacy of assembly.
Collapse
Affiliation(s)
- Niranjan Nagarajan
- Computational and Systems Biology, Genome Institute of Singapore, 138672 Singapore
| | | |
Collapse
|
85
|
Schirmer M, Sloan WT, Quince C. Benchmarking of viral haplotype reconstruction programmes: an overview of the capacities and limitations of currently available programmes. Brief Bioinform 2012; 15:431-42. [PMID: 23257116 DOI: 10.1093/bib/bbs081] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Viral haplotype reconstruction from a set of observed reads is one of the most challenging problems in bioinformatics today. Next-generation sequencing technologies enable us to detect single-nucleotide polymorphisms (SNPs) of haplotypes-even if the haplotypes appear at low frequencies. However, there are two major problems. First, we need to distinguish real SNPs from sequencing errors. Second, we need to determine which SNPs occur on the same haplotype, which cannot be inferred from the reads if the distance between SNPs on a haplotype exceeds the read length. We conducted an independent benchmarking study that directly compares the currently available viral haplotype reconstruction programmes. We also present nine in silico data sets that we generated to reflect biologically plausible populations. For these data sets, we simulated 454 and Illumina reads and applied the programmes to test their capacity to reconstruct whole genomes and individual genes. We developed a novel statistical framework to demonstrate the strengths and limitations of the programmes. Our benchmarking demonstrated that all the programmes we tested performed poorly when sequence divergence was low and failed to recover haplotype populations with rare haplotypes.
Collapse
Affiliation(s)
- Melanie Schirmer
- University of Glasgow, Rankine Building, Oakfield Avenue, Glasgow G12 8LT, UK. Tel.: +44-141-330-6311.
| | | | | |
Collapse
|
86
|
Zagordi O, Däumer M, Beisel C, Beerenwinkel N. Read length versus depth of coverage for viral quasispecies reconstruction. PLoS One 2012; 7:e47046. [PMID: 23056573 PMCID: PMC3463535 DOI: 10.1371/journal.pone.0047046] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2012] [Accepted: 09/07/2012] [Indexed: 02/07/2023] Open
Abstract
Recent advancements of sequencing technology have opened up unprecedented opportunities in many application areas. Virus samples can now be sequenced efficiently with very deep coverage to infer the genetic diversity of the underlying virus populations. Several sequencing platforms with different underlying technologies and performance characteristics are available for viral diversity studies. Here, we investigate how the differences between two common platforms provided by 454/Roche and Illumina affect viral diversity estimation and the reconstruction of viral haplotypes. Using a mixture of ten HIV clones sequenced with both platforms and additional simulation experiments, we assessed the trade-off between sequencing coverage, read length, and error rate. For fixed costs, short Illumina reads can be generated at higher coverage and allow for detecting variants at lower frequencies. They can also be sufficient to assess the diversity of the sample if sequences are dissimilar enough, but, in general, assembly of full-length haplotypes is feasible only with the longer 454/Roche reads. The quantitative comparison highlights the advantages and disadvantages of both platforms and provides guidance for the design of viral diversity studies.
Collapse
Affiliation(s)
- Osvaldo Zagordi
- Institute of Medical Virology, University of Zurich, Zurich, Switzerland
| | - Martin Däumer
- Institute of Immunology and Genetics, Kaiserslautern, Germany
| | - Christian Beisel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| |
Collapse
|
87
|
Beerenwinkel N, Günthard HF, Roth V, Metzner KJ. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol 2012; 3:329. [PMID: 22973268 PMCID: PMC3438994 DOI: 10.3389/fmicb.2012.00329] [Citation(s) in RCA: 171] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2012] [Accepted: 08/24/2012] [Indexed: 12/17/2022] Open
Abstract
Many viruses, including the clinically relevant RNA viruses HIV (human immunodeficiency virus) and HCV (hepatitis C virus), exist in large populations and display high genetic heterogeneity within and between infected hosts. Assessing intra-patient viral genetic diversity is essential for understanding the evolutionary dynamics of viruses, for designing effective vaccines, and for the success of antiviral therapy. Next-generation sequencing (NGS) technologies allow the rapid and cost-effective acquisition of thousands to millions of short DNA sequences from a single sample. However, this approach entails several challenges in experimental design and computational data analysis. Here, we review the entire process of inferring viral diversity from sample collection to computing measures of genetic diversity. We discuss sample preparation, including reverse transcription and amplification, and the effect of experimental conditions on diversity estimates due to in vitro base substitutions, insertions, deletions, and recombination. The use of different NGS platforms and their sequencing error profiles are compared in the context of various applications of diversity estimation, ranging from the detection of single nucleotide variants (SNVs) to the reconstruction of whole-genome haplotypes. We describe the statistical and computational challenges arising from these technical artifacts, and we review existing approaches, including available software, for their solution. Finally, we discuss open problems, and highlight successful biomedical applications and potential future clinical use of NGS to estimate viral diversity.
Collapse
Affiliation(s)
- Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH ZurichBasel, Switzerland
- Swiss Institute of BioinformaticsBasel, Switzerland
| | - Huldrych F. Günthard
- Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, University of ZurichZurich, Switzerland
| | - Volker Roth
- Department of Mathematics and Computer Science, University of BaselBasel, Switzerland
| | - Karin J. Metzner
- Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, University of ZurichZurich, Switzerland
| |
Collapse
|
88
|
Macalalad AR, Zody MC, Charlebois P, Lennon NJ, Newman RM, Malboeuf CM, Ryan EM, Boutwell CL, Power KA, Brackney DE, Pesko KN, Levin JZ, Ebel GD, Allen TM, Birren BW, Henn MR. Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data. PLoS Comput Biol 2012; 8:e1002417. [PMID: 22438797 PMCID: PMC3305335 DOI: 10.1371/journal.pcbi.1002417] [Citation(s) in RCA: 102] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2011] [Accepted: 01/20/2012] [Indexed: 11/19/2022] Open
Abstract
Viruses diversify over time within hosts, often undercutting the effectiveness of host defenses and therapeutic interventions. To design successful vaccines and therapeutics, it is critical to better understand viral diversification, including comprehensively characterizing the genetic variants in viral intra-host populations and modeling changes from transmission through the course of infection. Massively parallel sequencing technologies can overcome the cost constraints of older sequencing methods and obtain the high sequence coverage needed to detect rare genetic variants (<1%) within an infected host, and to assay variants without prior knowledge. Critical to interpreting deep sequence data sets is the ability to distinguish biological variants from process errors with high sensitivity and specificity. To address this challenge, we describe V-Phaser, an algorithm able to recognize rare biological variants in mixed populations. V-Phaser uses covariation (i.e. phasing) between observed variants to increase sensitivity and an expectation maximization algorithm that iteratively recalibrates base quality scores to increase specificity. Overall, V-Phaser achieved >97% sensitivity and >97% specificity on control read sets. On data derived from a patient after four years of HIV-1 infection, V-Phaser detected 2,015 variants across the ∼10 kb genome, including 603 rare variants (<1% frequency) detected only using phase information. V-Phaser identified variants at frequencies down to 0.2%, comparable to the detection threshold of allele-specific PCR, a method that requires prior knowledge of the variants. The high sensitivity and specificity of V-Phaser enables identifying and tracking changes in low frequency variants in mixed populations such as RNA viruses. New sequencing technologies provide unprecedented resolution to study pathogen populations, such as the single stranded RNA viruses HIV, dengue (DENV), and West Nile (WNV), and how they evolve within infected individuals in response to immune, therapeutic, and vaccine pressures. While these new technologies provide high volumes of data, these data contain process errors. To detect biological variants, especially those occurring at low frequencies in the population, these technologies require a method to differentiate biological variants from process errors with high sensitivity and specificity. To address this challenge, we introduce the V-Phaser algorithm, which distinguished the covariation of biological variants from that of process errors. We validate the method by measuring how frequently it correctly identifies variants and errors on actual read sets with known variation. Further, using data derived from a patient following four years of HIV-1 infection, we show that V-Phaser can detect biological variants at frequencies comparable to approaches that require prior knowledge. V-Phaser is available for download at: http://www.broadinstitute.org/scientific-community/software.
Collapse
Affiliation(s)
- Alexander R. Macalalad
- Broad Institute of MIT & Harvard, Cambridge, Massachusetts, United States of America
- Department of Biostatistics, Harvard University, Boston, Massachusetts, United States of America
| | - Michael C. Zody
- Broad Institute of MIT & Harvard, Cambridge, Massachusetts, United States of America
| | - Patrick Charlebois
- Broad Institute of MIT & Harvard, Cambridge, Massachusetts, United States of America
| | - Niall J. Lennon
- Broad Institute of MIT & Harvard, Cambridge, Massachusetts, United States of America
| | - Ruchi M. Newman
- Broad Institute of MIT & Harvard, Cambridge, Massachusetts, United States of America
| | - Christine M. Malboeuf
- Broad Institute of MIT & Harvard, Cambridge, Massachusetts, United States of America
| | - Elizabeth M. Ryan
- Broad Institute of MIT & Harvard, Cambridge, Massachusetts, United States of America
| | - Christian L. Boutwell
- Ragon Institute of MGH, MIT and Harvard, Boston, Massachusetts, United States of America
| | - Karen A. Power
- Ragon Institute of MGH, MIT and Harvard, Boston, Massachusetts, United States of America
| | - Doug E. Brackney
- Department of Pathology, University of New Mexico School of Medicine, Albuquerque, New Mexico, United States of America
| | - Kendra N. Pesko
- Department of Pathology, University of New Mexico School of Medicine, Albuquerque, New Mexico, United States of America
| | - Joshua Z. Levin
- Broad Institute of MIT & Harvard, Cambridge, Massachusetts, United States of America
| | - Gregory D. Ebel
- Department of Pathology, University of New Mexico School of Medicine, Albuquerque, New Mexico, United States of America
| | - Todd M. Allen
- Ragon Institute of MGH, MIT and Harvard, Boston, Massachusetts, United States of America
| | - Bruce W. Birren
- Broad Institute of MIT & Harvard, Cambridge, Massachusetts, United States of America
| | - Matthew R. Henn
- Broad Institute of MIT & Harvard, Cambridge, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|