1
|
Choudhari J, Choubey J, Verma M, Chatterjee T, Sahariah B. Metagenomics: the boon for microbial world knowledge and current challenges. Bioinformatics 2022. [DOI: 10.1016/b978-0-323-89775-4.00022-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
|
2
|
Zhang F, Kang HM. FASTQuick: rapid and comprehensive quality assessment of raw sequence reads. Gigascience 2021; 10:giab004. [PMID: 33511994 PMCID: PMC7844880 DOI: 10.1093/gigascience/giab004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Revised: 11/23/2020] [Accepted: 01/14/2021] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Rapid and thorough quality assessment of sequenced genomes on an ultra-high-throughput scale is crucial for successful large-scale genomic studies. Comprehensive quality assessment typically requires full genome alignment, which costs a substantial amount of computational resources and turnaround time. Existing tools are either computationally expensive owing to full alignment or lacking essential quality metrics by skipping read alignment. FINDINGS We developed a set of rapid and accurate methods to produce comprehensive quality metrics directly from a subset of raw sequence reads (from whole-genome or whole-exome sequencing) without full alignment. Our methods offer orders of magnitude faster turnaround time than existing full alignment-based methods while providing comprehensive and sophisticated quality metrics, including estimates of genetic ancestry and cross-sample contamination. CONCLUSIONS By rapidly and comprehensively performing the quality assessment, our tool will help investigators detect potential issues in ultra-high-throughput sequence reads in real time within a low computational cost at the early stages of the analyses, ensuring high-quality downstream results and preventing unexpected loss in time, money, and invaluable specimens.
Collapse
Affiliation(s)
- Fan Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, 100 Washington Ave, Ann Arbor, MI 48109, USA
| | - Hyun Min Kang
- Department of Biostatistics, University of Michigan School of Public Health, 1415 Washington Heights, Ann Arbor, MI 48109, USA
| |
Collapse
|
3
|
Liu X, Yan Z, Wu C, Yang Y, Li X, Zhang G. FastProNGS: fast preprocessing of next-generation sequencing reads. BMC Bioinformatics 2019; 20:345. [PMID: 31208325 PMCID: PMC6580563 DOI: 10.1186/s12859-019-2936-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Accepted: 06/06/2019] [Indexed: 11/10/2022] Open
Abstract
Background Next-generation sequencing technology is developing rapidly and the vast amount of data that is generated needs to be preprocessed for downstream analyses. However, until now, software that can efficiently make all the quality assessments and filtration of raw data is still lacking. Results We developed FastProNGS to integrate the quality control process with automatic adapter removal. Parallel processing was implemented to speed up the process by allocating multiple threads. Compared with similar up-to-date preprocessing tools, FastProNGS is by far the fastest. Read information before and after filtration can be output in plain-text, JSON, or HTML formats with user-friendly visualization. Conclusions FastProNGS is a rapid, standardized, and user-friendly tool for preprocessing next-generation sequencing data within minutes. It is an all-in-one software that is convenient for bulk data analysis. It is also very flexible and can implement different functions using different user-set parameter combinations. Electronic supplementary material The online version of this article (10.1186/s12859-019-2936-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiaoshuang Liu
- Megagenomics Corporation, Beijing, China.,Ping An Health Technology, Beijing, China
| | - Zhenhe Yan
- Megagenomics Corporation, Beijing, China.,NLP, R&D Suning, Beijing, China
| | - Chao Wu
- National Renewable Energy Laboratory CO, Colorado, USA
| | - Yang Yang
- Megagenomics Corporation, Beijing, China
| | - Xiaomin Li
- Megagenomics Corporation, Beijing, China
| | | |
Collapse
|
4
|
SeqSQC: A Bioconductor Package for Evaluating the Sample Quality of Next-generation Sequencing Data. GENOMICS PROTEOMICS & BIOINFORMATICS 2019; 17:211-218. [PMID: 30959223 PMCID: PMC6620264 DOI: 10.1016/j.gpb.2018.07.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Revised: 06/06/2018] [Accepted: 07/27/2018] [Indexed: 11/25/2022]
Abstract
As next-generation sequencing (NGS) technology has become widely used to identify genetic causal variants for various diseases and traits, a number of packages for checking NGS data quality have sprung up in public domains. In addition to the quality of sequencing data, sample quality issues, such as gender mismatch, abnormal inbreeding coefficient, cryptic relatedness, and population outliers, can also have fundamental impact on downstream analysis. However, there is a lack of tools specialized in identifying problematic samples from NGS data, often due to the limitation of sample size and variant counts. We developed SeqSQC, a Bioconductor package, to automate and accelerate sample cleaning in NGS data of any scale. SeqSQC is designed for efficient data storage and access, and equipped with interactive plots for intuitive data visualization to expedite the identification of problematic samples. SeqSQC is available at http://bioconductor.org/packages/SeqSQC.
Collapse
|
5
|
Al Kawam A, Khatri S, Datta A. A Survey of Software and Hardware Approaches to Performing Read Alignment in Next Generation Sequencing. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1202-1213. [PMID: 27362989 DOI: 10.1109/tcbb.2016.2586070] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Computational genomics is an emerging field that is enabling us to reveal the origins of life and the genetic basis of diseases such as cancer. Next Generation Sequencing (NGS) technologies have unleashed a wealth of genomic information by producing immense amounts of raw data. Before any functional analysis can be applied to this data, read alignment is applied to find the genomic coordinates of the produced sequences. Alignment algorithms have evolved rapidly with the advancement in sequencing technology, striving to achieve biological accuracy at the expense of increasing space and time complexities. Hardware approaches have been proposed to accelerate the computational bottlenecks created by the alignment process. Although several hardware approaches have achieved remarkable speedups, most have overlooked important biological features, which have hampered their widespread adoption by the genomics community. In this paper, we provide a brief biological introduction to genomics and NGS. We discuss the most popular next generation read alignment tools and algorithms. Furthermore, we provide a comprehensive survey of the hardware implementations used to accelerate these algorithms.
Collapse
|
6
|
Worthey EA. Analysis and Annotation of Whole-Genome or Whole-Exome Sequencing Derived Variants for Clinical Diagnosis. ACTA ACUST UNITED AC 2017; 95:9.24.1-9.24.28. [PMID: 29044471 DOI: 10.1002/cphg.49] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Over the last 10 years, next-generation sequencing (NGS) has transformed genomic research through substantial advances in technology and reduction in the cost of sequencing, and also in the systems required for analysis of these large volumes of data. This technology is now being used as a standard molecular diagnostic test in some clinical settings. The advances in sequencing have come so rapidly that the major bottleneck in identification of causal variants is no longer the sequencing or analysis (given access to appropriate tools), but rather clinical interpretation. Interpretation of genetic findings in a complex and ever changing clinical setting is scarcely a new challenge, but the task is increasingly complex in clinical genome-wide sequencing given the dramatic increase in dataset size and complexity. This increase requires application of appropriate interpretation tools, as well as development and application of appropriate methodologies and standard procedures. This unit provides an overview of these items. Specific challenges related to implementation of genome-wide sequencing in a clinical setting are discussed. © 2017 by John Wiley & Sons, Inc.
Collapse
|
7
|
Deng CH, Plummer KM, Jones DAB, Mesarich CH, Shiller J, Taranto AP, Robinson AJ, Kastner P, Hall NE, Templeton MD, Bowen JK. Comparative analysis of the predicted secretomes of Rosaceae scab pathogens Venturia inaequalis and V. pirina reveals expanded effector families and putative determinants of host range. BMC Genomics 2017; 18:339. [PMID: 28464870 PMCID: PMC5412055 DOI: 10.1186/s12864-017-3699-1] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 04/11/2017] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Fungal plant pathogens belonging to the genus Venturia cause damaging scab diseases of members of the Rosaceae. In terms of economic impact, the most important of these are V. inaequalis, which infects apple, and V. pirina, which is a pathogen of European pear. Given that Venturia fungi colonise the sub-cuticular space without penetrating plant cells, it is assumed that effectors that contribute to virulence and determination of host range will be secreted into this plant-pathogen interface. Thus the predicted secretomes of a range of isolates of Venturia with distinct host-ranges were interrogated to reveal putative proteins involved in virulence and pathogenicity. RESULTS Genomes of Venturia pirina (one European pear scab isolate) and Venturia inaequalis (three apple scab, and one loquat scab, isolates) were sequenced and the predicted secretomes of each isolate identified. RNA-Seq was conducted on the apple-specific V. inaequalis isolate Vi1 (in vitro and infected apple leaves) to highlight virulence and pathogenicity components of the secretome. Genes encoding over 600 small secreted proteins (candidate effectors) were identified, most of which are novel to Venturia, with expansion of putative effector families a feature of the genus. Numerous genes with similarity to Leptosphaeria maculans AvrLm6 and the Verticillium spp. Ave1 were identified. Candidates for avirulence effectors with cognate resistance genes involved in race-cultivar specificity were identified, as were putative proteins involved in host-species determination. Candidate effectors were found, on average, to be in regions of relatively low gene-density and in closer proximity to repeats (e.g. transposable elements), compared with core eukaryotic genes. CONCLUSIONS Comparative secretomics has revealed candidate effectors from Venturia fungal plant pathogens that attack pome fruit. Effectors that are putative determinants of host range were identified; both those that may be involved in race-cultivar and host-species specificity. Since many of the effector candidates are in close proximity to repetitive sequences this may point to a possible mechanism for the effector gene family expansion observed and a route to diversification via transposition and repeat-induced point mutation.
Collapse
Affiliation(s)
- Cecilia H. Deng
- The New Zealand Institute for Plant & Food Research Limited (PFR), Auckland, New Zealand
| | - Kim M. Plummer
- Animal, Plant & Soil Sciences Department, AgriBio Centre for AgriBioscience, La Trobe University, Melbourne, Victoria Australia
- Plant Biosecurity Cooperative Research Centre, Bruce, ACT Australia
| | - Darcy A. B. Jones
- Animal, Plant & Soil Sciences Department, AgriBio Centre for AgriBioscience, La Trobe University, Melbourne, Victoria Australia
- Present Address: The Centre for Crop and Disease Management, Curtin University, Bentley, Australia
| | - Carl H. Mesarich
- The New Zealand Institute for Plant & Food Research Limited (PFR), Auckland, New Zealand
- The School of Biological Sciences, University of Auckland, Auckland, New Zealand
- Present Address: Institute of Agriculture & Environment, Massey University, Palmerston North, New Zealand
| | - Jason Shiller
- Animal, Plant & Soil Sciences Department, AgriBio Centre for AgriBioscience, La Trobe University, Melbourne, Victoria Australia
- Present Address: INRA-Angers, Beaucouzé, Cedex, France
| | - Adam P. Taranto
- Animal, Plant & Soil Sciences Department, AgriBio Centre for AgriBioscience, La Trobe University, Melbourne, Victoria Australia
- Plant Sciences Division, Research School of Biology, The Australian National University, Canberra, Australia
| | - Andrew J. Robinson
- Animal, Plant & Soil Sciences Department, AgriBio Centre for AgriBioscience, La Trobe University, Melbourne, Victoria Australia
- Life Sciences Computation Centre, Victorian Life Sciences Computation Initiative (VLSCI), Victoria, Australia
| | - Patrick Kastner
- Animal, Plant & Soil Sciences Department, AgriBio Centre for AgriBioscience, La Trobe University, Melbourne, Victoria Australia
| | - Nathan E. Hall
- Animal, Plant & Soil Sciences Department, AgriBio Centre for AgriBioscience, La Trobe University, Melbourne, Victoria Australia
- Life Sciences Computation Centre, Victorian Life Sciences Computation Initiative (VLSCI), Victoria, Australia
| | - Matthew D. Templeton
- The New Zealand Institute for Plant & Food Research Limited (PFR), Auckland, New Zealand
- The School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Joanna K. Bowen
- The New Zealand Institute for Plant & Food Research Limited (PFR), Auckland, New Zealand
| |
Collapse
|
8
|
Merino GA, Murua YA, Fresno C, Sendoya JM, Golubicki M, Iseas S, Coraglio M, Podhajcer OL, Llera AS, Fernández EA. TarSeqQC: Quality control on targeted sequencing experiments in R. Hum Mutat 2017; 38:494-502. [DOI: 10.1002/humu.23204] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2016] [Revised: 02/06/2017] [Accepted: 02/19/2017] [Indexed: 01/15/2023]
Affiliation(s)
- Gabriela A. Merino
- Ua Area Cs. Agr. Ing. Bio. Y S, Conicet; Universidad Católica de Córdoba; Córdoba Argentina
- Facultad de Ciencias Exactas; Físicas y Naturales; Universidad Nacional de Córdoba; Córdoba Argentina
| | - Yanina A. Murua
- Fundación Instituto Leloir and Instituto de Investigaciones Bioquímicas de Buenos Aires-CONICET; Buenos Aires Argentina
| | - Cristóbal Fresno
- Ua Area Cs. Agr. Ing. Bio. Y S, Conicet; Universidad Católica de Córdoba; Córdoba Argentina
| | - Juan M. Sendoya
- Fundación Instituto Leloir and Instituto de Investigaciones Bioquímicas de Buenos Aires-CONICET; Buenos Aires Argentina
| | - Mariano Golubicki
- Intergrupo Argentino para el Tratamiento de los Tumores Gastrointestinales; Buenos Aires Argentina
| | - Soledad Iseas
- Hospital de Gastroenterología “Dr. Carlos Bonorino Udaondo”; Buenos Aires Argentina
| | - Mariana Coraglio
- Hospital de Gastroenterología “Dr. Carlos Bonorino Udaondo”; Buenos Aires Argentina
| | - Osvaldo L. Podhajcer
- Fundación Instituto Leloir and Instituto de Investigaciones Bioquímicas de Buenos Aires-CONICET; Buenos Aires Argentina
| | - Andrea S. Llera
- Fundación Instituto Leloir and Instituto de Investigaciones Bioquímicas de Buenos Aires-CONICET; Buenos Aires Argentina
| | - Elmer A. Fernández
- Ua Area Cs. Agr. Ing. Bio. Y S, Conicet; Universidad Católica de Córdoba; Córdoba Argentina
- Facultad de Ciencias Exactas; Físicas y Naturales; Universidad Nacional de Córdoba; Córdoba Argentina
| |
Collapse
|
9
|
Yan W, Dai J, Xu Z, Shi D, Chen D, Xu X, Song K, Yao Y, Li L, Ikegawa S, Teng H, Jiang Q. Novel WISP3 mutations causing progressive pseudorheumatoid dysplasia in two Chinese families. Hum Genome Var 2016; 3:16041. [PMID: 28018607 PMCID: PMC5143363 DOI: 10.1038/hgv.2016.41] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Revised: 09/30/2016] [Accepted: 09/30/2016] [Indexed: 02/06/2023] Open
Abstract
Progressive pseudorheumatoid dysplasia (PPD) is a rare disease caused by mutations in the gene for Wnt1-inducible signaling pathway protein 3 (WISP3). Here, we report the clinical and radiographic manifestations of two Chinese PPD patients. We performed whole-exome sequencing for one patient and sequenced the WISP3 for the other. Three WISP3 mutations (c.396T>G, c.721T>G and c.679dup) were identified; the two missense mutations were novel. Our study expanded the WISP3 mutation spectrum.
Collapse
Affiliation(s)
- Wenjin Yan
- Department of Sports Medicine and Adult Reconstructive Surgery, Drum Tower Hospital, School of Medicine, Nanjing University, Jiangsu, China
| | - Jin Dai
- Department of Sports Medicine and Adult Reconstructive Surgery, Drum Tower Hospital, School of Medicine, Nanjing University, Jiangsu, China
| | - Zhihong Xu
- Department of Sports Medicine and Adult Reconstructive Surgery, Drum Tower Hospital, School of Medicine, Nanjing University, Jiangsu, China
| | - Dongquan Shi
- Department of Sports Medicine and Adult Reconstructive Surgery, Drum Tower Hospital, School of Medicine, Nanjing University, Jiangsu, China
| | - Dongyang Chen
- Department of Sports Medicine and Adult Reconstructive Surgery, Drum Tower Hospital, School of Medicine, Nanjing University, Jiangsu, China
| | - Xingquan Xu
- Department of Sports Medicine and Adult Reconstructive Surgery, Drum Tower Hospital, School of Medicine, Nanjing University, Jiangsu, China
| | - Kai Song
- Department of Sports Medicine and Adult Reconstructive Surgery, Drum Tower Hospital, School of Medicine, Nanjing University, Jiangsu, China
| | - Yao Yao
- Department of Sports Medicine and Adult Reconstructive Surgery, Drum Tower Hospital, School of Medicine, Nanjing University, Jiangsu, China
| | - Lan Li
- Department of Sports Medicine and Adult Reconstructive Surgery, Drum Tower Hospital, School of Medicine, Nanjing University, Jiangsu, China
| | - Shiro Ikegawa
- Laboratory for Bone and Joint Diseases, Center for Integrative Medical Sciences, Tokyo, Japan
| | - Huajian Teng
- Laboratory for Bone and Joint Disease, Model Animal Research Center (MARC), Nanjing University, Jiangsu, China
| | - Qing Jiang
- Department of Sports Medicine and Adult Reconstructive Surgery, Drum Tower Hospital, School of Medicine, Nanjing University, Jiangsu, China
| |
Collapse
|
10
|
Parikh HI, Koparde VN, Bradley SP, Buck GA, Sheth NU. MeFiT: merging and filtering tool for illumina paired-end reads for 16S rRNA amplicon sequencing. BMC Bioinformatics 2016; 17:491. [PMID: 27905885 PMCID: PMC5134250 DOI: 10.1186/s12859-016-1358-1] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2016] [Accepted: 11/19/2016] [Indexed: 12/18/2022] Open
Abstract
Background Recent advances in next-generation sequencing have revolutionized genomic research. 16S rRNA amplicon sequencing using paired-end sequencing on the MiSeq platform from Illumina, Inc., is being used to characterize the composition and dynamics of extremely complex/diverse microbial communities. For this analysis on the Illumina platform, merging and quality filtering of paired-end reads are essential first steps in data analysis to ensure the accuracy and reliability of downstream analysis. Results We have developed the Merging and Filtering Tool (MeFiT) to combine these pre-processing steps into one simple, intuitive pipeline. MeFiT invokes CASPER (context-aware scheme for paired-end reads) for merging paired-end reads and provides users the option to quality filter the reads using the traditional average Q-score metric or using a maximum expected error cut-off threshold. Conclusions MeFiT provides an open-source solution that permits users to merge and filter paired end illumina reads. The tool has been implemented in python and the source-code is freely available at https://github.com/nisheth/MeFiT. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1358-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hardik I Parikh
- Department of Microbiology and Immunology, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Vishal N Koparde
- Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Steven P Bradley
- Department of Microbiology and Immunology, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Gregory A Buck
- Department of Microbiology and Immunology, Virginia Commonwealth University, Richmond, Virginia, USA.,Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Nihar U Sheth
- Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, Virginia, USA.
| |
Collapse
|
11
|
Jaing CJ, McLoughlin KS, Thissen JB, Zemla A, Gardner SN, Vergez LM, Bourguet F, Mabery S, Fofanov VY, Koshinsky H, Jackson PJ. Identification of Genome-Wide Mutations in Ciprofloxacin-Resistant F. tularensis LVS Using Whole Genome Tiling Arrays and Next Generation Sequencing. PLoS One 2016; 11:e0163458. [PMID: 27668749 PMCID: PMC5036845 DOI: 10.1371/journal.pone.0163458] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2016] [Accepted: 09/08/2016] [Indexed: 11/19/2022] Open
Abstract
Francisella tularensis is classified as a Class A bioterrorism agent by the U.S. government due to its high virulence and the ease with which it can be spread as an aerosol. It is a facultative intracellular pathogen and the causative agent of tularemia. Ciprofloxacin (Cipro) is a broad spectrum antibiotic effective against Gram-positive and Gram-negative bacteria. Increased Cipro resistance in pathogenic microbes is of serious concern when considering options for medical treatment of bacterial infections. Identification of genes and loci that are associated with Ciprofloxacin resistance will help advance the understanding of resistance mechanisms and may, in the future, provide better treatment options for patients. It may also provide information for development of assays that can rapidly identify Cipro-resistant isolates of this pathogen. In this study, we selected a large number of F. tularensis live vaccine strain (LVS) isolates that survived in progressively higher Ciprofloxacin concentrations, screened the isolates using a whole genome F. tularensis LVS tiling microarray and Illumina sequencing, and identified both known and novel mutations associated with resistance. Genes containing mutations encode DNA gyrase subunit A, a hypothetical protein, an asparagine synthase, a sugar transamine/perosamine synthetase and others. Structural modeling performed on these proteins provides insights into the potential function of these proteins and how they might contribute to Cipro resistance mechanisms.
Collapse
Affiliation(s)
- Crystal J. Jaing
- Physical Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, 94550, United States of America
- * E-mail:
| | - Kevin S. McLoughlin
- Computations Directorate, Lawrence Livermore National Laboratory, Livermore, CA, 94550, United States of America
| | - James B. Thissen
- Physical Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, 94550, United States of America
| | - Adam Zemla
- Computations Directorate, Lawrence Livermore National Laboratory, Livermore, CA, 94550, United States of America
| | - Shea N. Gardner
- Computations Directorate, Lawrence Livermore National Laboratory, Livermore, CA, 94550, United States of America
| | - Lisa M. Vergez
- Physical Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, 94550, United States of America
| | - Feliza Bourguet
- Physical Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, 94550, United States of America
| | - Shalini Mabery
- Physical Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, 94550, United States of America
| | | | | | - Paul J. Jackson
- Physical Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, 94550, United States of America
| |
Collapse
|
12
|
Adusumalli S, Mohd Omar MF, Soong R, Benoukraf T. Methodological aspects of whole-genome bisulfite sequencing analysis. Brief Bioinform 2014; 16:369-79. [DOI: 10.1093/bib/bbu016] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2014] [Accepted: 04/17/2014] [Indexed: 12/17/2022] Open
|
13
|
|
14
|
Worthey EA. Analysis and annotation of whole-genome or whole-exome sequencing-derived variants for clinical diagnosis. CURRENT PROTOCOLS IN HUMAN GENETICS 2013; 79:9.24.1-9.24.24. [PMID: 24510652 DOI: 10.1002/0471142905.hg0924s79] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Over the last several years, next-generation sequencing (NGS) has transformed genomic research through substantial advances in technology and reduction in the cost of sequencing, and also in the systems required for analysis of these large volumes of data. This technology is now being used as a standard molecular diagnostic test under particular circumstances in some clinical settings. The advances in sequencing have come so rapidly that the major bottleneck in identification of causal variants is no longer the sequencing but rather the analysis and interpretation. Interpretation of genetic findings in a clinical setting is scarcely a new challenge, but the task is increasingly complex in clinical genome-wide sequencing given the dramatic increase in dataset size and complexity. This increase requires the development of novel or repositioned analysis tools, methodologies, and processes. This unit provides an overview of these items. Specific challenges related to implementation in a clinical setting are discussed.
Collapse
Affiliation(s)
- Elizabeth A Worthey
- Department of Pediatrics, Medical College of Wisconsin, Milwaukee, Wisconsin.,The Human and Molecular Genetics Center, Medical College of Wisconsin, Milwaukee, Wisconsin.,Department of Computer Science, University of Wisconsin, Milwaukee, Wisconsin
| |
Collapse
|
15
|
Affiliation(s)
- Cadhla Firth
- Center for Infection and Immunity, Mailman School of Public Health, Columbia University, New York, NY 10032; ,
| | - W. Ian Lipkin
- Center for Infection and Immunity, Mailman School of Public Health, Columbia University, New York, NY 10032; ,
| |
Collapse
|
16
|
Yang X, Liu D, Liu F, Wu J, Zou J, Xiao X, Zhao F, Zhu B. HTQC: a fast quality control toolkit for Illumina sequencing data. BMC Bioinformatics 2013; 14:33. [PMID: 23363224 PMCID: PMC3571943 DOI: 10.1186/1471-2105-14-33] [Citation(s) in RCA: 112] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2012] [Accepted: 01/27/2013] [Indexed: 11/21/2022] Open
Abstract
Background Illumina sequencing platform is widely used in genome research. Sequence reads quality assessment and control are needed for downstream analysis. However, software that provides efficient quality assessment and versatile filtration methods is still lacking. Results We have developed a toolkit named HTQC – abbreviation of High-Throughput Quality Control – for sequence reads quality control, which consists of six programs for reads quality assessment, reads filtration and generation of graphic reports. Conclusions The HTQC toolkit can generate reads quality assessment faster than existing tools, providing guidance for reads filtration utilities that allow users to choose different strategies to remove low quality reads.
Collapse
Affiliation(s)
- Xi Yang
- CAS Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, Chinese Academy of Sciences, NO,1 West Beichen Road, Chaoyang District, Beijing, China
| | | | | | | | | | | | | | | |
Collapse
|
17
|
Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 2013; 15:256-78. [PMID: 23341494 PMCID: PMC3956068 DOI: 10.1093/bib/bbs086] [Citation(s) in RCA: 335] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual genomic landscapes and identify mutations relevant for diagnosis and therapy. Specifically, whole-exome sequencing using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due to the moderate costs, manageable data amounts and straightforward interpretation of analysis results. While whole-exome and, in the near future, whole-genome sequencing are becoming commodities, data analysis still poses significant challenges and led to the development of a plethora of tools supporting specific parts of the analysis workflow or providing a complete solution. Here, we surveyed 205 tools for whole-genome/whole-exome sequencing data analysis supporting five distinct analytical steps: quality assessment, alignment, variant identification, variant annotation and visualization. We report an overview of the functionality, features and specific requirements of the individual tools. We then selected 32 programs for variant identification, variant annotation and visualization, which were subjected to hands-on evaluation using four data sets: one set of exome data from two patients with a rare disease for testing identification of germline mutations, two cancer data sets for testing variant callers for somatic mutations, copy number variations and structural variations, and one semi-synthetic data set for testing identification of copy number variations. Our comprehensive survey and evaluation of NGS tools provides a valuable guideline for human geneticists working on Mendelian disorders, complex diseases and cancers.
Collapse
Affiliation(s)
- Stephan Pabinger
- Division for Bioinformatics, Innsbruck Medical University, Innrain 80, 6020 Innsbruck, Austria. Tel.: +43-512-9003-71401; Fax: +43-512-9003-73100;
| | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Pipes L, Li S, Bozinoski M, Palermo R, Peng X, Blood P, Kelly S, Weiss JM, Thierry-Mieg J, Thierry-Mieg D, Zumbo P, Chen R, Schroth GP, Mason CE, Katze MG. The non-human primate reference transcriptome resource (NHPRTR) for comparative functional genomics. Nucleic Acids Res 2012. [PMID: 23203872 PMCID: PMC3531109 DOI: 10.1093/nar/gks1268] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
RNA-based next-generation sequencing (RNA-Seq) provides a tremendous amount of new information regarding gene and transcript structure, expression and regulation. This is particularly true for non-coding RNAs where whole transcriptome analyses have revealed that the much of the genome is transcribed and that many non-coding transcripts have widespread functionality. However, uniform resources for raw, cleaned and processed RNA-Seq data are sparse for most organisms and this is especially true for non-human primates (NHPs). Here, we describe a large-scale RNA-Seq data and analysis infrastructure, the NHP reference transcriptome resource (http://nhprtr.org); it presently hosts data from12 species of primates, to be expanded to 15 species/subspecies spanning great apes, old world monkeys, new world monkeys and prosimians. Data are collected for each species using pools of RNA from comparable tissues. We provide data access in advance of its deposition at NCBI, as well as browsable tracks of alignments against the human genome using the UCSC genome browser. This resource will continue to host additional RNA-Seq data, alignments and assemblies as they are generated over the coming years and provide a key resource for the annotation of NHP genomes as well as informing primate studies on evolution, reproduction, infection, immunity and pharmacology.
Collapse
Affiliation(s)
- Lenore Pipes
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, NY 10065, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
19
|
Saxer G, Havlak P, Fox SA, Quance MA, Gupta S, Fofanov Y, Strassmann JE, Queller DC. Whole genome sequencing of mutation accumulation lines reveals a low mutation rate in the social amoeba Dictyostelium discoideum. PLoS One 2012; 7:e46759. [PMID: 23056439 PMCID: PMC3466296 DOI: 10.1371/journal.pone.0046759] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2012] [Accepted: 09/03/2012] [Indexed: 12/18/2022] Open
Abstract
Spontaneous mutations play a central role in evolution. Despite their importance, mutation rates are some of the most elusive parameters to measure in evolutionary biology. The combination of mutation accumulation (MA) experiments and whole-genome sequencing now makes it possible to estimate mutation rates by directly observing new mutations at the molecular level across the whole genome. We performed an MA experiment with the social amoeba Dictyostelium discoideum and sequenced the genomes of three randomly chosen lines using high-throughput sequencing to estimate the spontaneous mutation rate in this model organism. The mitochondrial mutation rate of 6.76×10(-9), with a Poisson confidence interval of 4.1×10(-9) - 9.5×10(-9), per nucleotide per generation is slightly lower than estimates for other taxa. The mutation rate estimate for the nuclear DNA of 2.9×10(-11), with a Poisson confidence interval ranging from 7.4×10(-13) to 1.6×10(-10), is the lowest reported for any eukaryote. These results are consistent with low microsatellite mutation rates previously observed in D. discoideum and low levels of genetic variation observed in wild D. discoideum populations. In addition, D. discoideum has been shown to be quite resistant to DNA damage, which suggests an efficient DNA-repair mechanism that could be an adaptation to life in soil and frequent exposure to intracellular and extracellular mutagenic compounds. The social aspect of the life cycle of D. discoideum and a large portion of the genome under relaxed selection during vegetative growth could also select for a low mutation rate. This hypothesis is supported by a significantly lower mutation rate per cell division in multicellular eukaryotes compared with unicellular eukaryotes.
Collapse
Affiliation(s)
- Gerda Saxer
- Department of Ecology and Evolutionary Biology, Rice University, Houston, Texas, United States of America.
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Fischer M, Snajder R, Pabinger S, Dander A, Schossig A, Zschocke J, Trajanoski Z, Stocker G. SIMPLEX: cloud-enabled pipeline for the comprehensive analysis of exome sequencing data. PLoS One 2012; 7:e41948. [PMID: 22870267 PMCID: PMC3411592 DOI: 10.1371/journal.pone.0041948] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2012] [Accepted: 06/28/2012] [Indexed: 01/24/2023] Open
Abstract
In recent studies, exome sequencing has proven to be a successful screening tool for the identification of candidate genes causing rare genetic diseases. Although underlying targeted sequencing methods are well established, necessary data handling and focused, structured analysis still remain demanding tasks. Here, we present a cloud-enabled autonomous analysis pipeline, which comprises the complete exome analysis workflow. The pipeline combines several in-house developed and published applications to perform the following steps: (a) initial quality control, (b) intelligent data filtering and pre-processing, (c) sequence alignment to a reference genome, (d) SNP and DIP detection, (e) functional annotation of variants using different approaches, and (f) detailed report generation during various stages of the workflow. The pipeline connects the selected analysis steps, exposes all available parameters for customized usage, performs required data handling, and distributes computationally expensive tasks either on a dedicated high-performance computing infrastructure or on the Amazon cloud environment (EC2). The presented application has already been used in several research projects including studies to elucidate the role of rare genetic diseases. The pipeline is continuously tested and is publicly available under the GPL as a VirtualBox or Cloud image at http://simplex.i-med.ac.at; additional supplementary data is provided at http://www.icbi.at/exome.
Collapse
Affiliation(s)
- Maria Fischer
- Division for Bioinformatics, Biocenter, Innsbruck Medical University, Innsbruck, Austria
| | - Rene Snajder
- Division for Bioinformatics, Biocenter, Innsbruck Medical University, Innsbruck, Austria
- Oncotyrol, Center for Personalized Cancer Medicine, Innsbruck, Austria
| | - Stephan Pabinger
- Division for Bioinformatics, Biocenter, Innsbruck Medical University, Innsbruck, Austria
| | - Andreas Dander
- Division for Bioinformatics, Biocenter, Innsbruck Medical University, Innsbruck, Austria
- Oncotyrol, Center for Personalized Cancer Medicine, Innsbruck, Austria
| | - Anna Schossig
- Division of Human Genetics, Biocenter, Innsbruck Medical University, Innsbruck, Austria
| | - Johannes Zschocke
- Division of Human Genetics, Biocenter, Innsbruck Medical University, Innsbruck, Austria
| | - Zlatko Trajanoski
- Division for Bioinformatics, Biocenter, Innsbruck Medical University, Innsbruck, Austria
| | - Gernot Stocker
- Division for Bioinformatics, Biocenter, Innsbruck Medical University, Innsbruck, Austria
| |
Collapse
|
21
|
Golovko G, Khanipov K, Rojas M, Martinez-Alcántara A, Howard JJ, Ballesteros E, Gupta S, Widger W, Fofanov Y. Slim-filter: an interactive Windows-based application for illumina genome analyzer data assessment and manipulation. BMC Bioinformatics 2012; 13:166. [PMID: 22800377 PMCID: PMC3505481 DOI: 10.1186/1471-2105-13-166] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2011] [Accepted: 06/18/2012] [Indexed: 11/10/2022] Open
Abstract
Background The emergence of Next Generation Sequencing technologies has made it possible for individual investigators to generate gigabases of sequencing data per week. Effective analysis and manipulation of these data is limited due to large file sizes, so even simple tasks such as data filtration and quality assessment have to be performed in several steps. This requires (potentially problematic) interaction between the investigator and a bioinformatics/computational service provider. Furthermore, such services are often performed using specialized computational facilities. Results We present a Windows-based application, Slim-Filter designed to interactively examine the statistical properties of sequencing reads produced by Illumina Genome Analyzer and to perform a broad spectrum of data manipulation tasks including: filtration of low quality and low complexity reads; filtration of reads containing undesired subsequences (such as parts of adapters and PCR primers used during the sample and sequencing libraries preparation steps); excluding duplicated reads (while keeping each read’s copy number information in a specialized data format); and sorting reads by copy numbers allowing for easy access and manual editing of the resulting files. Slim-Filter is organized as a sequence of windows summarizing the statistical properties of the reads. Each data manipulation step has roll-back abilities, allowing for return to previous steps of the data analysis process. Slim-Filter is written in C++ and is compatible with fasta, fastq, and specialized AS file formats presented in this manuscript. Setup files and a user’s manual are available for download at the supplementary web site (
https://www.bioinfo.uh.edu/Slim_Filter/). Conclusion The presented Windows-based application has been developed with the goal of providing individual investigators with integrated sequencing reads analysis, curation, and manipulation capabilities.
Collapse
Affiliation(s)
- Georgiy Golovko
- Center for BioMedical and Environmental Genomics, University of Houston, Houston, TX, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Zhang T, Luo Y, Liu K, Pan L, Zhang B, Yu J, Hu S. BIGpre: a quality assessment package for next-generation sequencing data. GENOMICS PROTEOMICS & BIOINFORMATICS 2012; 9:238-44. [PMID: 22289480 PMCID: PMC5054156 DOI: 10.1016/s1672-0229(11)60027-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/04/2011] [Accepted: 11/23/2011] [Indexed: 11/25/2022]
Abstract
The emergence of next-generation sequencing (NGS) technologies has significantly improved sequencing throughput and reduced costs. However, the short read length, duplicate reads and massive volume of data make the data processing much more difficult and complicated than the first-generation sequencing technology. Although there are some software packages developed to assess the data quality, those packages either are not easily available to users or require bioinformatics skills and computer resources. Moreover, almost all the quality assessment software currently available didn’t taken into account the sequencing errors when dealing with the duplicate assessment in NGS data. Here, we present a new user-friendly quality assessment software package called BIGpre, which works for both Illumina and 454 platforms. BIGpre contains all the functions of other quality assessment software, such as the correlation between forward and reverse reads, read GC-content distribution, and base Ns quality. More importantly, BIGpre incorporates associated programs to detect and remove duplicate reads after taking sequencing errors into account and trimming low quality reads from raw data as well. BIGpre is primarily written in Perl and integrates graphical capability from the statistics package R. This package produces both tabular and graphical summaries of data quality for sequencing datasets from Illumina and 454 platforms. Processing hundreds of millions reads within minutes, this package provides immediate diagnostic information for user to manipulate sequencing data for downstream analyses. BIGpre is freely available at http://bigpre.sourceforge.net/.
Collapse
Affiliation(s)
- Tongwu Zhang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China
- James D. Watson Institute of Genome Sciences, Zhejiang University, Hangzhou 31007, China
| | - Yingfeng Luo
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China
| | - Kan Liu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China
| | - Linlin Pan
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China
| | - Bing Zhang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China
| | - Jun Yu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China
| | - Songnian Hu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China
- Corresponding author.
| |
Collapse
|
23
|
Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 2012; 7:e30619. [PMID: 22312429 PMCID: PMC3270013 DOI: 10.1371/journal.pone.0030619] [Citation(s) in RCA: 2054] [Impact Index Per Article: 171.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2011] [Accepted: 12/21/2011] [Indexed: 11/18/2022] Open
Abstract
Next generation sequencing (NGS) technologies provide a high-throughput means to generate large amount of sequence data. However, quality control (QC) of sequence data generated from these technologies is extremely important for meaningful downstream analysis. Further, highly efficient and fast processing tools are required to handle the large volume of datasets. Here, we have developed an application, NGS QC Toolkit, for quality check and filtering of high-quality data. This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html. All the tools in the application have been implemented in Perl programming language. The toolkit is comprised of user-friendly tools for QC of sequencing data generated using Roche 454 and Illumina platforms, and additional tools to aid QC (sequence format converter and trimming tools) and analysis (statistics tools). A variety of options have been provided to facilitate the QC at user-defined parameters. The toolkit is expected to be very useful for the QC of NGS data to facilitate better downstream analysis.
Collapse
Affiliation(s)
- Ravi K. Patel
- Functional Genomics and Bioinformatics Laboratory, National Institute of Plant Genome Research (NIPGR), New Delhi, India
| | - Mukesh Jain
- Functional Genomics and Bioinformatics Laboratory, National Institute of Plant Genome Research (NIPGR), New Delhi, India
- * E-mail:
| |
Collapse
|
24
|
Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 2012. [PMID: 22312429 DOI: 10.1371/journal.0030619] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/14/2023] Open
Abstract
Next generation sequencing (NGS) technologies provide a high-throughput means to generate large amount of sequence data. However, quality control (QC) of sequence data generated from these technologies is extremely important for meaningful downstream analysis. Further, highly efficient and fast processing tools are required to handle the large volume of datasets. Here, we have developed an application, NGS QC Toolkit, for quality check and filtering of high-quality data. This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html. All the tools in the application have been implemented in Perl programming language. The toolkit is comprised of user-friendly tools for QC of sequencing data generated using Roche 454 and Illumina platforms, and additional tools to aid QC (sequence format converter and trimming tools) and analysis (statistics tools). A variety of options have been provided to facilitate the QC at user-defined parameters. The toolkit is expected to be very useful for the QC of NGS data to facilitate better downstream analysis.
Collapse
Affiliation(s)
- Ravi K Patel
- Functional Genomics and Bioinformatics Laboratory, National Institute of Plant Genome Research (NIPGR), New Delhi, India
| | | |
Collapse
|
25
|
Abstract
Next generation sequencing (NGS) technologies provide a high-throughput means to generate large amount of sequence data. However, quality control (QC) of sequence data generated from these technologies is extremely important for meaningful downstream analysis. Further, highly efficient and fast processing tools are required to handle the large volume of datasets. Here, we have developed an application, NGS QC Toolkit, for quality check and filtering of high-quality data. This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html. All the tools in the application have been implemented in Perl programming language. The toolkit is comprised of user-friendly tools for QC of sequencing data generated using Roche 454 and Illumina platforms, and additional tools to aid QC (sequence format converter and trimming tools) and analysis (statistics tools). A variety of options have been provided to facilitate the QC at user-defined parameters. The toolkit is expected to be very useful for the QC of NGS data to facilitate better downstream analysis.
Collapse
Affiliation(s)
- Ravi K Patel
- Functional Genomics and Bioinformatics Laboratory, National Institute of Plant Genome Research (NIPGR), New Delhi, India
| | | |
Collapse
|
26
|
Su Z, Ning B, Fang H, Hong H, Perkins R, Tong W, Shi L. Next-generation sequencing and its applications in molecular diagnostics. Expert Rev Mol Diagn 2011; 11:333-43. [PMID: 21463242 DOI: 10.1586/erm.11.3] [Citation(s) in RCA: 125] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
DNA sequencing is a powerful approach for decoding a number of human diseases, including cancers. The advent of next-generation sequencing (NGS) technologies has reduced sequencing cost by orders of magnitude and significantly increased the throughput, making whole-genome sequencing a possible way for obtaining global genomic information about patients on whom clinical actions may be taken. However, the benefits offered by NGS technologies come with a number of challenges that must be adequately addressed before they can be transformed from research tools to routine clinical practices. This article provides an overview of four commonly used NGS technologies from Roche Applied Science//454 Life Sciences, Illumina, Life Technologies and Helicos Biosciences. The challenges in the analysis of NGS data and their potential applications in clinical diagnosis are also discussed.
Collapse
Affiliation(s)
- Zhenqiang Su
- Z-Tech, an ICF International Company at US FDA's National Center for Toxicological Research, 3900 NCTR Road, Jefferson, AR 72079, USA
| | | | | | | | | | | | | |
Collapse
|
27
|
Hummel M, Bonnin S, Lowy E, Roma G. TEQC: an R package for quality control in target capture experiments. Bioinformatics 2011; 27:1316-7. [PMID: 21398674 DOI: 10.1093/bioinformatics/btr122] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
UNLABELLED TEQC is an R/Bioconductor package for quality assessment of target enrichment experiments. Quality measures comprise specificity and sensitivity of the capture, enrichment, per-target read coverage and its relation to hybridization probe characteristics, coverage uniformity and reproducibility, and read duplicate analysis. Several diagnostic plots allow visual inspection of the data quality. AVAILABILITY AND IMPLEMENTATION TEQC is implemented in the R language (version >2.12.0) and is available as a Bioconductor package for Linux, Windows and MacOS from www.bioconductor.org.
Collapse
Affiliation(s)
- Manuela Hummel
- Core Facilities, Centre for Genomic Regulation, UPF, Barcelona, Spain.
| | | | | | | |
Collapse
|
28
|
Stolzenberg D, Grant PA, Bekiranov S. Epigenetic methodologies for behavioral scientists. Horm Behav 2011; 59:407-16. [PMID: 20955712 PMCID: PMC3093106 DOI: 10.1016/j.yhbeh.2010.10.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/13/2010] [Revised: 10/05/2010] [Accepted: 10/05/2010] [Indexed: 01/12/2023]
Abstract
Hormones are essential regulators of many behaviors. Steroids bind either to nuclear or membrane receptors while peptides primarily act via membrane receptors. After a ligand binds, the conformational change in the receptor initiates changes in cell signaling cascades (membrane receptors) or direct alternations in DNA transcription (steroid receptors). Changes in gene transcription that result are responsible for protein production and ultimately behavioral modifications. A significant part of how hormones affect DNA transcription is via epigenetic modifications of DNA and/or the chromatin in which it is entwined. These alterations lead to transcriptional changes that ultimately define the phenotype and function of a given cell. Importantly we now know that environmental stimuli influence epigenetic marks, which in the context of neuroendocrinology can lead to behavioral changes. Importantly tracking epigenetic states and profiling the epigenome within cells require the use of epigenetic methodologies and subsequent data analysis. Here we describe the techniques of particular importance in the mapping of DNA methylation, histone modifications and occupancy of chromatin bound effector proteins that regulate gene expression. For researchers wanting to move into these levels of analysis we discuss the application of modern sequencing technologies applied in assays such as chromatin immunoprecipitation and the bioinformatics analysis involved in the rich datasets generated.
Collapse
Affiliation(s)
| | | | - Stefan Bekiranov
- Corresponding author: Dr. Stefan Berkiranov, PO Box 800733, University of Virginia School of Medicine, Charlottesville VA 22908,
| |
Collapse
|
29
|
Dai M, Thompson RC, Maher C, Contreras-Galindo R, Kaplan MH, Markovitz DM, Omenn G, Meng F. NGSQC: cross-platform quality analysis pipeline for deep sequencing data. BMC Genomics 2010; 11 Suppl 4:S7. [PMID: 21143816 PMCID: PMC3005923 DOI: 10.1186/1471-2164-11-s4-s7] [Citation(s) in RCA: 84] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background While the accuracy and precision of deep sequencing data is significantly better than those obtained by the earlier generation of hybridization-based high throughput technologies, the digital nature of deep sequencing output often leads to unwarranted confidence in their reliability. Results The NGSQC (Next Generation Sequencing Quality Control) pipeline provides a set of novel quality control measures for quickly detecting a wide variety of quality issues in deep sequencing data derived from two dimensional surfaces, regardless of the assay technology used. It also enables researchers to determine whether sequencing data related to their most interesting biological discoveries are caused by sequencing quality issues. Conclusions Next generation sequencing platforms have their own share of quality issues and there can be significant lab-to-lab, batch-to-batch and even within chip/slide variations. NGSQC can help to ensure that biological conclusions, in particular those based on relatively rare sequence alterations, are not caused by low quality sequencing.
Collapse
Affiliation(s)
- Manhong Dai
- Department of Psychiatry, University of Michigan, Ann Arbor, MI 48109, USA.
| | | | | | | | | | | | | | | |
Collapse
|
30
|
SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 2010; 11:485. [PMID: 20875133 PMCID: PMC2956736 DOI: 10.1186/1471-2105-11-485] [Citation(s) in RCA: 975] [Impact Index Per Article: 69.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2010] [Accepted: 09/27/2010] [Indexed: 01/09/2023] Open
Abstract
Background Illumina's second-generation sequencing platform is playing an increasingly prominent role in modern DNA and RNA sequencing efforts. However, rapid, simple, standardized and independent measures of run quality are currently lacking, as are tools to process sequences for use in downstream applications based on read-level quality data. Results We present SolexaQA, a user-friendly software package designed to generate detailed statistics and at-a-glance graphics of sequence data quality both quickly and in an automated fashion. This package contains associated software to trim sequences dynamically using the quality scores of bases within individual reads. Conclusion The SolexaQA package produces standardized outputs within minutes, thus facilitating ready comparison between flow cell lanes and machine runs, as well as providing immediate diagnostic information to guide the manipulation of sequence data for downstream analyses.
Collapse
|
31
|
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010. [PMID: 20644199 DOI: 10.1101/gr.107524.110.20] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/14/2023]
Abstract
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Collapse
Affiliation(s)
- Aaron McKenna
- Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
32
|
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010. [PMID: 20644199 DOI: 10.1101/gr.107524.110.] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Collapse
Affiliation(s)
- Aaron McKenna
- Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
33
|
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010; 20:1297-303. [PMID: 20644199 DOI: 10.1101/gr.107524.110] [Citation(s) in RCA: 17296] [Impact Index Per Article: 1235.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Collapse
Affiliation(s)
- Aaron McKenna
- Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|