51
|
Byrd AL, Perez-Rogers JF, Manimaran S, Castro-Nallar E, Toma I, McCaffrey T, Siegel M, Benson G, Crandall KA, Johnson WE. Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data. BMC Bioinformatics 2014; 15:262. [PMID: 25091138 PMCID: PMC4131054 DOI: 10.1186/1471-2105-15-262] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2013] [Accepted: 07/31/2014] [Indexed: 11/17/2022] Open
Abstract
Background The use of sequencing technologies to investigate the microbiome of a sample can positively impact patient healthcare by providing therapeutic targets for personalized disease treatment. However, these samples contain genomic sequences from various sources that complicate the identification of pathogens. Results Here we present Clinical PathoScope, a pipeline to rapidly and accurately remove host contamination, isolate microbial reads, and identify potential disease-causing pathogens. We have accomplished three essential tasks in the development of Clinical PathoScope. First, we developed an optimized framework for pathogen identification using a computational subtraction methodology in concordance with read trimming and ambiguous read reassignment. Second, we have demonstrated the ability of our approach to identify multiple pathogens in a single clinical sample, accurately identify pathogens at the subspecies level, and determine the nearest phylogenetic neighbor of novel or highly mutated pathogens using real clinical sequencing data. Finally, we have shown that Clinical PathoScope outperforms previously published pathogen identification methods with regard to computational speed, sensitivity, and specificity. Conclusions Clinical PathoScope is the only pathogen identification method currently available that can identify multiple pathogens from mixed samples and distinguish between very closely related species and strains in samples with very few reads per pathogen. Furthermore, Clinical PathoScope does not rely on genome assembly and thus can more rapidly complete the analysis of a clinical sample when compared with current assembly-based methods. Clinical PathoScope is freely available at:
http://sourceforge.net/projects/pathoscope/. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-262) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Keith A Crandall
- Department of Bioinformatics, Boston University, Boston, MA, USA.
| | | |
Collapse
|
52
|
Abstract
High-throughput DNA sequencing has revolutionized the study of cancer genomics with numerous discoveries that are relevant to cancer diagnosis and treatment. The latest sequencing and analysis methods have successfully identified somatic alterations, including single-nucleotide variants, insertions and deletions, copy-number aberrations, structural variants and gene fusions. Additional computational techniques have proved useful for defining the mutations, genes and molecular networks that drive diverse cancer phenotypes and that determine clonal architectures in tumour samples. Collectively, these tools have advanced the study of genomic, transcriptomic and epigenomic alterations in cancer, and their association to clinical properties. Here, we review cancer genomics software and the insights that have been gained from their application.
Collapse
|
53
|
Naccache SN, Federman S, Veeraraghavan N, Zaharia M, Lee D, Samayoa E, Bouquet J, Greninger AL, Luk KC, Enge B, Wadford DA, Messenger SL, Genrich GL, Pellegrino K, Grard G, Leroy E, Schneider BS, Fair JN, Martínez MA, Isa P, Crump JA, DeRisi JL, Sittler T, Hackett J, Miller S, Chiu CY. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res 2014; 24:1180-92. [PMID: 24899342 PMCID: PMC4079973 DOI: 10.1101/gr.171934.113] [Citation(s) in RCA: 334] [Impact Index Per Article: 30.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Unbiased next-generation sequencing (NGS) approaches enable comprehensive pathogen detection in the clinical microbiology laboratory and have numerous applications for public health surveillance, outbreak investigation, and the diagnosis of infectious diseases. However, practical deployment of the technology is hindered by the bioinformatics challenge of analyzing results accurately and in a clinically relevant timeframe. Here we describe SURPI (“sequence-based ultrarapid pathogen identification”), a computational pipeline for pathogen identification from complex metagenomic NGS data generated from clinical samples, and demonstrate use of the pipeline in the analysis of 237 clinical samples comprising more than 1.1 billion sequences. Deployable on both cloud-based and standalone servers, SURPI leverages two state-of-the-art aligners for accelerated analyses, SNAP and RAPSearch, which are as accurate as existing bioinformatics tools but orders of magnitude faster in performance. In fast mode, SURPI detects viruses and bacteria by scanning data sets of 7–500 million reads in 11 min to 5 h, while in comprehensive mode, all known microorganisms are identified, followed by de novo assembly and protein homology searches for divergent viruses in 50 min to 16 h. SURPI has also directly contributed to real-time microbial diagnosis in acutely ill patients, underscoring its potential key role in the development of unbiased NGS-based clinical assays in infectious diseases that demand rapid turnaround times.
Collapse
Affiliation(s)
- Samia N Naccache
- Department of Laboratory Medicine, UCSF, San Francisco, California 94107, USA; UCSF-Abbott Viral Diagnostics and Discovery Center, San Francisco, California 94107, USA
| | - Scot Federman
- Department of Laboratory Medicine, UCSF, San Francisco, California 94107, USA; UCSF-Abbott Viral Diagnostics and Discovery Center, San Francisco, California 94107, USA
| | - Narayanan Veeraraghavan
- Department of Laboratory Medicine, UCSF, San Francisco, California 94107, USA; UCSF-Abbott Viral Diagnostics and Discovery Center, San Francisco, California 94107, USA
| | - Matei Zaharia
- Department of Computer Science, University of California, Berkeley, California 94720, USA
| | - Deanna Lee
- Department of Laboratory Medicine, UCSF, San Francisco, California 94107, USA; UCSF-Abbott Viral Diagnostics and Discovery Center, San Francisco, California 94107, USA
| | - Erik Samayoa
- Department of Laboratory Medicine, UCSF, San Francisco, California 94107, USA; UCSF-Abbott Viral Diagnostics and Discovery Center, San Francisco, California 94107, USA
| | - Jerome Bouquet
- Department of Laboratory Medicine, UCSF, San Francisco, California 94107, USA; UCSF-Abbott Viral Diagnostics and Discovery Center, San Francisco, California 94107, USA
| | | | - Ka-Cheung Luk
- Abbott Diagnostics, Abbott Park, Illinois 60064, USA
| | - Barryett Enge
- Viral and Rickettsial Disease Laboratory, California Department of Public Health, Richmond, California 94804, USA
| | - Debra A Wadford
- Viral and Rickettsial Disease Laboratory, California Department of Public Health, Richmond, California 94804, USA
| | - Sharon L Messenger
- Viral and Rickettsial Disease Laboratory, California Department of Public Health, Richmond, California 94804, USA
| | - Gillian L Genrich
- Department of Laboratory Medicine, UCSF, San Francisco, California 94107, USA
| | - Kristen Pellegrino
- Department of Family and Community Medicine, UCSF, San Francisco, California 94143, USA
| | - Gilda Grard
- Viral Emergent Diseases Unit, Centre International de Recherches Médicales de Franceville, Franceville, BP 769, Gabon
| | - Eric Leroy
- Viral Emergent Diseases Unit, Centre International de Recherches Médicales de Franceville, Franceville, BP 769, Gabon
| | | | - Joseph N Fair
- Metabiota, Inc., San Francisco, California 94104, USA
| | - Miguel A Martínez
- Departamento de Genética del Desarrollo y Fisiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, 62260, Mexico
| | - Pavel Isa
- Departamento de Genética del Desarrollo y Fisiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, 62260, Mexico
| | - John A Crump
- Division of Infectious Diseases and International Health and the Duke Global Health Institute, Duke University Medical Center, Durham, North Carolina 27708, USA; Kilimanjaro Christian Medical Centre, Moshi, Kilimanjaro, 7393, Tanzania; Centre for International Health, University of Otago, Dunedin, 9054, New Zealand
| | - Joseph L DeRisi
- Department of Biochemistry, UCSF, San Francisco, California 94107, USA
| | - Taylor Sittler
- Department of Laboratory Medicine, UCSF, San Francisco, California 94107, USA
| | - John Hackett
- Abbott Diagnostics, Abbott Park, Illinois 60064, USA
| | - Steve Miller
- Department of Laboratory Medicine, UCSF, San Francisco, California 94107, USA; UCSF-Abbott Viral Diagnostics and Discovery Center, San Francisco, California 94107, USA
| | - Charles Y Chiu
- Department of Laboratory Medicine, UCSF, San Francisco, California 94107, USA; UCSF-Abbott Viral Diagnostics and Discovery Center, San Francisco, California 94107, USA; Department of Medicine, Division of Infectious Diseases, UCSF, San Francisco, California 94143, USA
| |
Collapse
|
54
|
Caboche S, Audebert C, Hot D. High-Throughput Sequencing, a VersatileWeapon to Support Genome-Based Diagnosis in Infectious Diseases: Applications to Clinical Bacteriology. Pathogens 2014; 3:258-79. [PMID: 25437800 PMCID: PMC4243446 DOI: 10.3390/pathogens3020258] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2013] [Revised: 02/28/2014] [Accepted: 03/20/2014] [Indexed: 12/19/2022] Open
Abstract
The recent progresses of high-throughput sequencing (HTS) technologies enable easy and cost-reduced access to whole genome sequencing (WGS) or re-sequencing. HTS associated with adapted, automatic and fast bioinformatics solutions for sequencing applications promises an accurate and timely identification and characterization of pathogenic agents. Many studies have demonstrated that data obtained from HTS analysis have allowed genome-based diagnosis, which has been consistent with phenotypic observations. These proofs of concept are probably the first steps toward the future of clinical microbiology. From concept to routine use, many parameters need to be considered to promote HTS as a powerful tool to help physicians and clinicians in microbiological investigations. This review highlights the milestones to be completed toward this purpose.
Collapse
Affiliation(s)
- Ségolène Caboche
- FRE 3642 Molecular and Cellular Medecine, CNRS, Institut Pasteur de Lille and University Lille Nord de France, Lille 59019, France.
| | | | - David Hot
- FRE 3642 Molecular and Cellular Medecine, CNRS, Institut Pasteur de Lille and University Lille Nord de France, Lille 59019, France.
| |
Collapse
|
55
|
Hong C, Manimaran S, Shen Y, Perez-Rogers JF, Byrd AL, Castro-Nallar E, Crandall KA, Johnson WE. PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. MICROBIOME 2014; 2:33. [PMID: 25225611 PMCID: PMC4164323 DOI: 10.1186/2049-2618-2-33] [Citation(s) in RCA: 165] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2014] [Accepted: 07/23/2014] [Indexed: 05/20/2023]
Abstract
BACKGROUND Recent innovations in sequencing technologies have provided researchers with the ability to rapidly characterize the microbial content of an environmental or clinical sample with unprecedented resolution. These approaches are producing a wealth of information that is providing novel insights into the microbial ecology of the environment and human health. However, these sequencing-based approaches produce large and complex datasets that require efficient and sensitive computational analysis workflows. Many recent tools for analyzing metagenomic-sequencing data have emerged, however, these approaches often suffer from issues of specificity, efficiency, and typically do not include a complete metagenomic analysis framework. RESULTS We present PathoScope 2.0, a complete bioinformatics framework for rapidly and accurately quantifying the proportions of reads from individual microbial strains present in metagenomic sequencing data from environmental or clinical samples. The pipeline performs all necessary computational analysis steps; including reference genome library extraction and indexing, read quality control and alignment, strain identification, and summarization and annotation of results. We rigorously evaluated PathoScope 2.0 using simulated data and data from the 2011 outbreak of Shiga-toxigenic Escherichia coli O104:H4. CONCLUSIONS The results show that PathoScope 2.0 is a complete, highly sensitive, and efficient approach for metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy. The PathoScope 2.0 pipeline software is freely available for download at: http://sourceforge.net/projects/pathoscope/.
Collapse
Affiliation(s)
- Changjin Hong
- Computational Biomedicine, Boston University School of Medicine, 72 E Concord St. E645, Boston, MA 02118, USA
| | - Solaiappan Manimaran
- Computational Biomedicine, Boston University School of Medicine, 72 E Concord St. E645, Boston, MA 02118, USA
| | - Ying Shen
- Computational Biomedicine, Boston University School of Medicine, 72 E Concord St. E645, Boston, MA 02118, USA
| | - Joseph F Perez-Rogers
- Computational Biomedicine, Boston University School of Medicine, 72 E Concord St. E645, Boston, MA 02118, USA
- Bioinformatics Program, Boston University, Boston, MA 02125, USA
| | - Allyson L Byrd
- Bioinformatics Program, Boston University, Boston, MA 02125, USA
| | - Eduardo Castro-Nallar
- Computational Biology Institute, George Washington University, Ashburn, VA 20147, USA
| | - Keith A Crandall
- Computational Biology Institute, George Washington University, Ashburn, VA 20147, USA
| | - William Evan Johnson
- Computational Biomedicine, Boston University School of Medicine, 72 E Concord St. E645, Boston, MA 02118, USA
- Bioinformatics Program, Boston University, Boston, MA 02125, USA
| |
Collapse
|
56
|
Deorowicz S, Grabowski S. Data compression for sequencing data. Algorithms Mol Biol 2013; 8:25. [PMID: 24252160 PMCID: PMC3868316 DOI: 10.1186/1748-7188-8-25] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2013] [Accepted: 09/25/2013] [Indexed: 12/12/2022] Open
Abstract
Post-Sanger sequencing methods produce tons of data, and there is a general
agreement that the challenge to store and process them must be addressed
with data compression. In this review we first answer the question
“why compression” in a quantitative manner. Then we also answer
the questions “what” and “how”, by sketching the
fundamental compression ideas, describing the main sequencing data types and
formats, and comparing the specialized compression algorithms and tools.
Finally, we go back to the question “why compression” and give
other, perhaps surprising answers, demonstrating the pervasiveness of data
compression techniques in computational biology.
Collapse
|
57
|
Borozan I, Watt SN, Ferretti V. Evaluation of alignment algorithms for discovery and identification of pathogens using RNA-Seq. PLoS One 2013; 8:e76935. [PMID: 24204709 PMCID: PMC3813700 DOI: 10.1371/journal.pone.0076935] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2013] [Accepted: 09/04/2013] [Indexed: 01/02/2023] Open
Abstract
Next-generation sequencing technologies provide an unparallelled opportunity for the characterization and discovery of known and novel viruses. Because viruses are known to have the highest mutation rates when compared to eukaryotic and bacterial organisms, we assess the extent to which eleven well-known alignment algorithms (BLAST, BLAT, BWA, BWA-SW, BWA-MEM, BFAST, Bowtie2, Novoalign, GSNAP, SHRiMP2 and STAR) can be used for characterizing mutated and non-mutated viral sequences--including those that exhibit RNA splicing--in transcriptome samples. To evaluate aligners objectively we developed a realistic RNA-Seq simulation and evaluation framework (RiSER) and propose a new combined score to rank aligners for viral characterization in terms of their precision, sensitivity and alignment accuracy. We used RiSER to simulate both human and viral read sequences and suggest the best set of aligners for viral sequence characterization in human transcriptome samples. Our results show that significant and substantial differences exist between aligners and that a digital-subtraction-based viral identification framework can and should use different aligners for different parts of the process. We determine the extent to which mutated viral sequences can be effectively characterized and show that more sensitive aligners such as BLAST, BFAST, SHRiMP2, BWA-SW and GSNAP can accurately characterize substantially divergent viral sequences with up to 15% overall sequence mutation rate. We believe that the results presented here will be useful to researchers choosing aligners for viral sequence characterization using next-generation sequencing data.
Collapse
Affiliation(s)
- Ivan Borozan
- Informatics and Bio-computing, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- * E-mail:
| | - Stuart N. Watt
- Informatics and Bio-computing, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Vincent Ferretti
- Informatics and Bio-computing, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| |
Collapse
|
58
|
Sensitive detection of viral transcripts in human tumor transcriptomes. PLoS Comput Biol 2013; 9:e1003228. [PMID: 24098097 PMCID: PMC3789765 DOI: 10.1371/journal.pcbi.1003228] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2012] [Accepted: 06/04/2013] [Indexed: 02/07/2023] Open
Abstract
In excess of % of human cancer incidents have a viral cofactor. Epidemiological studies of idiopathic human cancers indicate that additional tumor viruses remain to be discovered. Recent advances in sequencing technology have enabled systematic screenings of human tumor transcriptomes for viral transcripts. However, technical problems such as low abundances of viral transcripts in large volumes of sequencing data, viral sequence divergence, and homology between viral and human factors significantly confound identification of tumor viruses. We have developed a novel computational approach for detecting viral transcripts in human cancers that takes the aforementioned confounding factors into account and is applicable to a wide variety of viruses and tumors. We apply the approach to conducting the first systematic search for viruses in neuroblastoma, the most common cancer in infancy. The diverse clinical progression of this disease as well as related epidemiological and virological findings are highly suggestive of a pathogenic cofactor. However, a viral etiology of neuroblastoma is currently contested. We mapped transcriptomes of neuroblastoma as well as positive and negative controls to the human and all known viral genomes in order to detect both known and unknown viruses. Analysis of controls, comparisons with related methods, and statistical estimates demonstrate the high sensitivity of our approach. Detailed investigation of putative viral transcripts within neuroblastoma samples did not provide evidence for the existence of any known human viruses. Likewise, de-novo assembly and analysis of chimeric transcripts did not result in expression signatures associated with novel human pathogens. While confounding factors such as sample dilution or viral clearance in progressed tumors may mask viral cofactors in the data, in principle, this is rendered less likely by the high sensitivity of our approach and the number of biological replicates analyzed. Therefore, our results suggest that frequent viral cofactors of metastatic neuroblastoma are unlikely. Many human cancers are caused by infections with tumor viruses and identification of these pathogens is considered a critical contribution to cancer prevention. Deep sequencing enables us to systematically investigate viral nucleotide signatures in order to either verify or exclude the existence of viruses in idiopathic human cancers. We have developed Virana, a novel computational approach for identifying tumor viruses in human cancers that is applicable to a wide variety of tumors and viruses. Virana firstly addresses several important biological confounding factors that may hinder successful detection of these pathogens. We applied our approach in the first systematic search for cancer-causing viruses in metastatic neuroblastoma, the most common form of cancer in infancy. Although the heterogeneous clinical progression of this disease as well as epidemiological and virological findings are suggestive of a pathogenic cofactor, the viral etiology of neuroblastoma is currently contested. We conducted an analysis of experimental controls, comparisons with related approaches, as well as statistical analyses in order to validate our method. In spite of the high sensitivity of our approach, analyses of neuroblastoma transcriptomes did not provide evidence for the existence of any known or unknown human viruses. Our results therefore suggest that frequent viral cofactors of metastatic neuroblastoma are unlikely.
Collapse
|
59
|
DeBoever C, Reid EG, Smith EN, Wang X, Dumaop W, Harismendy O, Carson D, Richman D, Masliah E, Frazer KA. Whole transcriptome sequencing enables discovery and analysis of viruses in archived primary central nervous system lymphomas. PLoS One 2013; 8:e73956. [PMID: 24023918 PMCID: PMC3762708 DOI: 10.1371/journal.pone.0073956] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2013] [Accepted: 07/24/2013] [Indexed: 11/23/2022] Open
Abstract
Primary central nervous system lymphomas (PCNSL) have a dramatically increased prevalence among persons living with AIDS and are known to be associated with human Epstein Barr virus (EBV) infection. Previous work suggests that in some cases, co-infection with other viruses may be important for PCNSL pathogenesis. Viral transcription in tumor samples can be measured using next generation transcriptome sequencing. We demonstrate the ability of transcriptome sequencing to identify viruses, characterize viral expression, and identify viral variants by sequencing four archived AIDS-related PCNSL tissue samples and analyzing raw sequencing reads. EBV was detected in all four PCNSL samples and cytomegalovirus (CMV), JC polyomavirus (JCV), and HIV were also discovered, consistent with clinical diagnoses. CMV was found to express three long non-coding RNAs recently reported as expressed during active infection. Single nucleotide variants were observed in each of the viruses observed and three indels were found in CMV. No viruses were found in several control tumor types including 32 diffuse large B-cell lymphoma samples. This study demonstrates the ability of next generation transcriptome sequencing to accurately identify viruses, including DNA viruses, in solid human cancer tissue samples.
Collapse
Affiliation(s)
- Christopher DeBoever
- Moores Cancer Center, University of California San Diego, La Jolla, California, United States of America
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, California, United States of America
| | - Erin G. Reid
- Moores Cancer Center, University of California San Diego, La Jolla, California, United States of America
| | - Erin N. Smith
- Moores Cancer Center, University of California San Diego, La Jolla, California, United States of America
- Department of Pediatrics and Rady Children’s Hospital, University of California San Diego, La Jolla, California, United States of America
| | - Xiaoyun Wang
- Moores Cancer Center, University of California San Diego, La Jolla, California, United States of America
- Department of Pediatrics and Rady Children’s Hospital, University of California San Diego, La Jolla, California, United States of America
| | - Wilmar Dumaop
- Department of Pathology, University of California San Diego, La Jolla, California, United States of America
| | - Olivier Harismendy
- Moores Cancer Center, University of California San Diego, La Jolla, California, United States of America
- Department of Pediatrics and Rady Children’s Hospital, University of California San Diego, La Jolla, California, United States of America
- Clinical and Translational Research Institute, University of California San Diego, La Jolla, California, United States of America
| | - Dennis Carson
- Moores Cancer Center, University of California San Diego, La Jolla, California, United States of America
| | - Douglas Richman
- VA San Diego Healthcare System and Center for AIDS Research, University of California San Diego, La Jolla, California, United States of America
| | - Eliezer Masliah
- Department of Neurosciences, University of California San Diego, La Jolla, California, United States of America
| | - Kelly A. Frazer
- Moores Cancer Center, University of California San Diego, La Jolla, California, United States of America
- Department of Pediatrics and Rady Children’s Hospital, University of California San Diego, La Jolla, California, United States of America
- Clinical and Translational Research Institute, University of California San Diego, La Jolla, California, United States of America
- Institute for Genomic Medicine, University of California San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
60
|
Francis OE, Bendall M, Manimaran S, Hong C, Clement NL, Castro-Nallar E, Snell Q, Schaalje GB, Clement MJ, Crandall KA, Johnson WE. Pathoscope: species identification and strain attribution with unassembled sequencing data. Genome Res 2013; 23:1721-9. [PMID: 23843222 PMCID: PMC3787268 DOI: 10.1101/gr.150151.112] [Citation(s) in RCA: 100] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Emerging next-generation sequencing technologies have revolutionized the collection of genomic data for applications in bioforensics, biosurveillance, and for use in clinical settings. However, to make the most of these new data, new methodology needs to be developed that can accommodate large volumes of genetic data in a computationally efficient manner. We present a statistical framework to analyze raw next-generation sequence reads from purified or mixed environmental or targeted infected tissue samples for rapid species identification and strain attribution against a robust database of known biological agents. Our method, Pathoscope, capitalizes on a Bayesian statistical framework that accommodates information on sequence quality, mapping quality, and provides posterior probabilities of matches to a known database of target genomes. Importantly, our approach also incorporates the possibility that multiple species can be present in the sample and considers cases when the sample species/strain is not in the reference database. Furthermore, our approach can accurately discriminate between very closely related strains of the same species with very little coverage of the genome and without the need for multiple alignment steps, extensive homology searches, or genome assembly--which are time-consuming and labor-intensive steps. We demonstrate the utility of our approach on genomic data from purified and in silico "environmental" samples from known bacterial agents impacting human health for accuracy assessment and comparison with other approaches.
Collapse
Affiliation(s)
- Owen E Francis
- Department of Statistics, Brigham Young University, Provo, Utah 84602, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
61
|
Abstract
Pathogen discovery is critically important to infectious diseases and public health. Nearly all new outbreaks are caused by the emergence of novel viruses. Genomic tools for pathogen discovery include consensus PCR, microarrays, and deep sequencing. Downstream studies are often necessary to link a candidate novel virus to a disease.
Viral pathogen discovery is of critical importance to clinical microbiology, infectious diseases, and public health. Genomic approaches for pathogen discovery, including consensus polymerase chain reaction (PCR), microarrays, and unbiased next-generation sequencing (NGS), have the capacity to comprehensively identify novel microbes present in clinical samples. Although numerous challenges remain to be addressed, including the bioinformatics analysis and interpretation of large datasets, these technologies have been successful in rapidly identifying emerging outbreak threats, screening vaccines and other biological products for microbial contamination, and discovering novel viruses associated with both acute and chronic illnesses. Downstream studies such as genome assembly, epidemiologic screening, and a culture system or animal model of infection are necessary to establish an association of a candidate pathogen with disease.
Collapse
|
62
|
Wang Q, Jia P, Zhao Z. VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data. PLoS One 2013; 8:e64465. [PMID: 23717618 PMCID: PMC3663743 DOI: 10.1371/journal.pone.0064465] [Citation(s) in RCA: 105] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Accepted: 04/08/2013] [Indexed: 11/23/2022] Open
Abstract
Next generation sequencing (NGS) technologies allow us to explore virus interactions with host genomes that lead to carcinogenesis or other diseases; however, this effort is largely hindered by the dearth of efficient computational tools. Here, we present a new tool, VirusFinder, for the identification of viruses and their integration sites in host genomes using NGS data, including whole transcriptome sequencing (RNA-Seq), whole genome sequencing (WGS), and targeted sequencing data. VirusFinder’s unique features include the characterization of insertion loci of virus of arbitrary type in the host genome and high accuracy and computational efficiency as a result of its well-designed pipeline. The source code as well as additional data of VirusFinder is publicly available at http://bioinfo.mc.vanderbilt.edu/VirusFinder/.
Collapse
Affiliation(s)
- Qingguo Wang
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
| | - Peilin Jia
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- Center for Quantitative Sciences, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
| | - Zhongming Zhao
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- Department of Psychiatry, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- Center for Quantitative Sciences, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- * E-mail:
| |
Collapse
|
63
|
Fimereli D, Detours V, Konopka T. TriageTools: tools for partitioning and prioritizing analysis of high-throughput sequencing data. Nucleic Acids Res 2013; 41:e86. [PMID: 23408855 PMCID: PMC3627586 DOI: 10.1093/nar/gkt094] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
High-throughput sequencing is becoming a popular research tool but carries with it considerable costs in terms of computation time, data storage and bandwidth. Meanwhile, some research applications focusing on individual genes or pathways do not necessitate processing of a full sequencing dataset. Thus, it is desirable to partition a large dataset into smaller, manageable, but relevant pieces. We present a toolkit for partitioning raw sequencing data that includes a method for extracting reads that are likely to map onto pre-defined regions of interest. We show the method can be used to extract information about genes of interest from DNA or RNA sequencing samples in a fraction of the time and disk space required to process and store a full dataset. We report speedup factors between 2.6 and 96, depending on settings and samples used. The software is available at http://www.sourceforge.net/projects/triagetools/.
Collapse
Affiliation(s)
- Danai Fimereli
- IRIBHM, Université Libre de Bruxelles, 808 Route de Lennick, 1070 Brussels, Belgium
| | | | | |
Collapse
|
64
|
Naeem R, Rashid M, Pain A. READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation. ACTA ACUST UNITED AC 2012. [PMID: 23193222 PMCID: PMC3562070 DOI: 10.1093/bioinformatics/bts684] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Summary: READSCAN is a highly scalable parallel program to identify non-host
sequences (of potential pathogen origin) and estimate their genome relative abundance in
high-throughput sequence datasets. READSCAN accurately classified human and viral
sequences on a 20.1 million reads simulated dataset in <27 min using a small Beowulf
compute cluster with 16 nodes (Supplementary Material). Availability:http://cbrc.kaust.edu.sa/readscan Contact:arnab.pain@kaust.edu.sa or raeece.naeem@gmail.com Supplementary information:Supplementary data are available at Bioinformatics
online.
Collapse
Affiliation(s)
- Raeece Naeem
- Pathogen Genomics Laboratory, Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal-23955-6900, Kingdom of Saudi Arabia.
| | | | | |
Collapse
|
65
|
Transcriptome sequencing in Sezary syndrome identifies Sezary cell and mycosis fungoides-associated lncRNAs and novel transcripts. Blood 2012; 120:3288-97. [PMID: 22936659 DOI: 10.1182/blood-2012-04-423061] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Sézary syndrome (SS) is an aggressive cutaneous T-cell lymphoma (CTCL) of unknown etiology in which malignant cells circulate in the peripheral blood. To identify viral elements, gene fusions, and gene expression patterns associated with this lymphoma, flow cytometry was used to obtain matched pure populations of malignant Sézary cells (SCs) versus nonmalignant CD4(+) T cells from 3 patients for whole transcriptome, paired-end sequencing with an average depth of 112 million reads per sample. Pathway analysis of differentially expressed genes identified mis-regulation of PI3K/Akt, TGFβ, and NF-κB pathways as well as T-cell receptor signaling. Bioinformatic analysis did not detect either nonhuman transcripts to support a viral etiology of SS or recurrently expressed gene fusions, but it did identify 21 SC-associated annotated long noncoding RNAs (lncRNAs). Transcriptome assembly by multiple algorithms identified 13 differentially expressed unannotated transcripts termed Sézary cell-associated transcripts (SeCATs) that include 12 predicted lncRNAs and a novel transcript with coding potential. High-throughput sequencing targeting the 3' end of polyadenylated transcripts in archived tumors from 24 additional patients with tumor-stage CTCL confirmed the differential expression of SC-associated lncRNAs and SeCATs in CTCL. Our findings characterize the SS transcriptome and support recent reports that implicate lncRNA dysregulation in human malignancies.
Collapse
|
66
|
Borozan I, Wilson S, Blanchette P, Laflamme P, Watt SN, Krzyzanowski PM, Sircoulomb F, Rottapel R, Branton PE, Ferretti V. CaPSID: a bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes. BMC Bioinformatics 2012; 13:206. [PMID: 22901030 PMCID: PMC3464663 DOI: 10.1186/1471-2105-13-206] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2012] [Accepted: 07/18/2012] [Indexed: 01/05/2023] Open
Abstract
Background It is now well established that nearly 20% of human cancers are caused by infectious agents, and the list of human oncogenic pathogens will grow in the future for a variety of cancer types. Whole tumor transcriptome and genome sequencing by next-generation sequencing technologies presents an unparalleled opportunity for pathogen detection and discovery in human tissues but requires development of new genome-wide bioinformatics tools. Results Here we present CaPSID (Computational Pathogen Sequence IDentification), a comprehensive bioinformatics platform for identifying, querying and visualizing both exogenous and endogenous pathogen nucleotide sequences in tumor genomes and transcriptomes. CaPSID includes a scalable, high performance database for data storage and a web application that integrates the genome browser JBrowse. CaPSID also provides useful metrics for sequence analysis of pre-aligned BAM files, such as gene and genome coverage, and is optimized to run efficiently on multiprocessor computers with low memory usage. Conclusions To demonstrate the usefulness and efficiency of CaPSID, we carried out a comprehensive analysis of both a simulated dataset and transcriptome samples from ovarian cancer. CaPSID correctly identified all of the human and pathogen sequences in the simulated dataset, while in the ovarian dataset CaPSID’s predictions were successfully validated in vitro.
Collapse
Affiliation(s)
- Ivan Borozan
- Ontario Institute for Cancer Research, MaRS Centre, Toronto, Ontario, Canada
| | | | | | | | | | | | | | | | | | | |
Collapse
|